Extracting a Million-Record Dataset from Historical NYC City Directories

Stephen Balogh, New York University
Nicholas Wolf, New York University
Bert Spaan, New York Public Library
NYU English Department, November 8th, 2017
What is a City Directory?
The NYPL has digizited more than 100 volumes: 1790 - 1920

NewYorkScapes Archives + Data Projects, 2014-2017


  • Municipal Archives Bodies in Transit Records
  • Emigrant Savings Bank Test Account Books
  • Fales Library Digital Downtown
  • NYC NYPL Business Directories

For more see newyorkscapes.org/project

Wilson Business Directory, 1852-53

New York City Directories: Potentials and Pitfalls

  • Provide attestations of addresses, including those not on maps
  • Occupations and names for the peopled landscape of the city
  • Additional attributes such as widowhood of females, marking of "colored"-ness
  • Chronological breadth: ~1786-1801, 1849-1922

NYU + NYPL Space/Time: Goals and Objectives

  • Build full digital-image-to-Space/Time-data workflow
  • Cleaning, ordering, and ingesting "At a Distance"
  • Replicable workflows for other city's directories
  • Maximize accuracy with minimal human correction

Tesseract and OCR

Doggett & Rode 1851/52 Directory

  • Tesseract3 (with training) ➔ Tesseract4 (without training): 10-15% improvement
  • OCR error rate of 5% sample with Tesseract4: 11% (with a std dev of 0.07 on post-1849 directories)

A Typology of OCR Errors and Uncertainties

Type Percent of Sample Remedy
Blacklist character(s) 0.94% Match string at a distance and swap offending character
No whitelist characters 4.08% Discard(?)

A Typology of OCR Errors and Uncertainties

Type Percent of SampleRemedy
Words not in name/occ/street dictionaries 2.2% Crowdsource corrections (if any)
Remainders 7.8% Sort errors/non-errors and match leftovers at a distance

Goal

From OCR'd text:


Hoffmann Max, physician, 70 Suffolk
Hoffmann Paul, clerk, h 61 Ridge
Hoffmaster Charles A. carver, 220 Centre, h Westchester
Hogan Catharine, wid. John, teacher, 28 James, h 55 James

To a searchable map:


<div class='ocr_page'>
  <p class='ocr_par'>
    <span class='ocr_line' title="bbox 18 679 641 712;">
      Hoffmann Max, physician, 70 Suffolk
    </span>
    <span class='ocr_line' title="bbox 19 712 569 745;">
      Hoffmann Paul, clerk, h 61 Ridge
    </span>
    <span class='ocr_line' title="bbox 21 1200 828 1234;">
      Hoffmaster Charles A. carver, 220 Centre, h West-
    </span>
    <span class='ocr_line' title="bbox 84 1234 562 1259;">
      chester
    </span>
    <span class='ocr_line' title="bbox 862 619 1675 655;">
      Hogan Catharine, wid. John, teacher, 28 James, h
     </span>
     <span class='ocr_line' title="bbox 927 653 1181 679;">
      55 James
     </span>
  </p>
</div>
Tesseract outputs hOCR: XML for OCR output
Visualization of bounding boxes of lines from hOCR
  1. Remove headers, page numbers, etc.;
  2. Connect indented lines

Turn hOCR lines into city directory entries
Histogram of X position of hOCR lines
Simple Statistics: statistical methods for JavaScript

const stats = require('simple-statistics')

const xs = page.lines.map((line) => line.properties.bbox[0])

// One indent per column, and four clusters for
//   other stuff like page numbers and headings
const columnCount = 2
const clusterCount = columnCount * 2 + 4

// Find clusters of x coordinates, sort by size
const clusters = stats
  .ckmeans(xs, clusterCount)
  .sort((a, b) => b.length - a.length)

// Compute mode of clusters, and use only
//   the clusters we need
const columns = clusters
  .slice(0, columnCount)
  .map(stats.mode)

console.log(columns)
// [ 34, 840 ]
Finding columns with Simple Statistics

For all lines that are outside of 2 columns:

  1. Find closest line in upper-left direction ↖
  2. Connect them if this line is inside column ⟷
Connecting indented lines
Finding the closest line with a 2D spatial index
RBush: R-tree-based 2D spatial index for JavaScript

const rbush = require('rbush')

// Create an Rbush R-tree index of all line bounding boxes:
const tree = rbush(page.lines.length)
tree.load(page.lines.map((line, index) => {
  const x = line.properties.bbox[0], y = line.properties.bbox[1]
  return {minX: x, minY: y, maxX: x, maxY: y, index}
}))
Create a 2D spatial index of all lines with RBush

const knn = require('rbush-knn')

page.lines
  // Only process lines that are not in one of the columns
  .filter((line) => line.columnIndex === undefined)
  .forEach((line, lineIndex) => {
    const lineX = line.properties.bbox[0]
    const lineY = line.properties.bbox[1]

    const filter = (item) => {
      const knnLine = page.lines[item.index]
      const knnLineX = knnLine.properties.bbox[0]
      const knnLineY = knnLine.properties.bbox[1]

      return lineIndex !== item.index &&
        knnLine.columnIndex !== undefined &&
        knnLineX <= lineX && knnLineY <= lineY
    })

    // Find closest 1 line to [lineX, lineY]
    const neighbors = knn(tree, lineX, lineY, 1, filter)

    // If neighbors.length >= 0, we should connect previousLine + line
    const previousLine = neighbors[0]
  })
rbush-knn: k-nearest neighbors search for RBush
Detecting columns, connecting lines
Node.js module to turn hOCR into entries

From:


<div class='ocr_page'>
  <p class='ocr_par'>
    <span class='ocr_line' title="bbox 21 1200 828 1234;">
      Hoffmaster Charles A. carver, 220 Centre, h West-
    </span>
    <span class='ocr_line' title="bbox 84 1234 562 1259;">
      chester
    </span>
  </p>
</div>

To:


{"lines":
  [
    {"text": "Hoffmaster Charles A. carver, 220 Centre, h Westchester"}
  ]
}
Input and output of our Node.js module for hOCR files
  • How do we extract structured knowledge from a directory entry?
Problem
A sample directory sheet...
Find names of individuals and companies
Find occupations
Find predicates of space
Find addresses and locations
George Binkenburg, pianomkr. h. 924 Third av.
  1. name
    "George Binkenburg"
  2. occupation
    "pianomkr."
  3. location predicate
    "h." (home)
  4. address
    "924 Third av."
Four main types of labels
Can we do this using regular expressions? First intuition: Regular expressions
Not really!
  1. OCR transcriptions are never perfect
    • There are mistranscriptions in the parsed text
    • This makes it very difficult to consistently tokenize the entries
First intuition: Regular expressions
  • For instance, consider the (correctly) transcribed entry:
    Binkenburg George, pianomkr. h. 924 Third av.
    • From this, it seems reasonable to split the entry just based on "," and "." characters, and we'd be pretty close!
  • Binkenburg George, pianomkr. h. 924 Third av.
First intuition: Regular expressions
  • Unfortunately, we rarely see such clean parses!
  • Something like this is much more likely:
    • BInkenbu rg Geor ge pianomkr. h 924 Th1rd av
First intuition: Regular expressions
  • And not only can we not trust the transcriptions 100%...
  • We also can't predict the amount of elements in an entry without inspecting first!
    • Burtnett Daniel, pres. cit. f. ins. co. 67 Wall & 64 Bowery, h. 14 Perry
    • Burt Edwin C. boots & shoes, 43 Broadway, h. 327 Henry, Brooklyn
First intuition: Regular expressions
Can we do this using regular expressions, plus a ton of conditional statements? Second intuition: A lot of conditionals!
... and we tried exactly this!
Sort of!
  1. The approach is:
    • Since we can't assume that any delimiter is correctly parsed, we'll first just tokenize everything we can
      • Bush Anna, widow of Henry J. r. 152 W. 13th
      • ⇒ ["Bush", "Anna", ",", "widow", "of", "Henry", "J", " .", "r", ".", "152", "W", ".", "13th"]
Second intuition: A lot of conditionals!
  • Then, we go through the tokens one by one, and try to figure out what label they belong to
  • We can use heuristic methods for classifying each token
    • Does it contain a number? If so, it's probably part of an address.
    • Does it contain a known street name? If so, it's again part of an address.
    • Is it a known occupation? If so, it must be an occupation name.
    • ...
Second intuition: A lot of conditionals!
  • This turns out to be a somewhat promising approach, but it is very tedious to engineer...
  • Our accuracy is contingent on our ability to correctly:
    1. Think of all of the relevant heuristic methods
    2. Interpret and weigh the results of the heuristics correctly
Second intuition: a lot of conditionals!
Can algorithms from machine learning do this better? Third intuition: Conditional random fields (CRF)
YES! 100% YES!
  1. We use a supervised-learning algorithm called "conditional random fields" (CRF)
    • Instead of trying to pre-empt what factors are important in labeling, and then interpret the results, we demonstrate by way of labeled examples, and let CRF infer what is important
  2. Excellent blog post from NYTimes Developers on CRF
Third intuition: Conditional random fields (CRF)
  • First we create training examples, at the same individual-token level as before:
  • Each token is labeled in terms of what tag it is part of
CRF training data
Label Abbreviation Example
START START (beginning of entry)
END END (end of entry)
Delimiter D "," , "."
Name component NC "Alice"
Occupation component OC "laborer"
Position attribute PA "h."
Address component AC "718", "Washington"
CRF labels
  • This sequence of labels accompanying the tokens is used for training, and also is the form that CRF predictions will take
  • Given a sequence of labels, we can then easily parse the record into discrete elements
    • START NC NC D OC OC D PA AC AC PA AC AC END
    • START NC NC D OC OC D PA AC AC PA AC AC END
    • name occupation address 1, address 2
CRF label decoding
  • Conditional random fields uses a discriminative classifier to produce the sequence of labels which maximizes:
    P( word-label | word-features)
  • Features encode information relevant to a word's label in a way that is sensitive to the word itself, as well as that word's location in a sequence
CRF parameter learning
  • Features are usually boolean (true/false) attributes that are computed for each token
    • In CRF, features describe not only the "present" token, but can also look at adjacent tokens
  • We have to create features similar to the heuristic functions we used earlier
CRF feature engineering
  • During training time, the algorithm will proceed through the examples, and find the optimal feature weights that maximize the probability of getting the correct label
    • (... this particular task is "convex", meaning there is only one optimum, so we can find it easily)
  • This process is done iteratively, often using a training algorithm called stochastic gradient descent
    • (... but luckily for us, we use a library that abstracts this away!)
CRF training
  • Assume that the input we are trying to classify is:
    • Bush Chauncey, commissioner, 157 B'way, h. Rathbun's hotel
CRF features
  • First we'll go through the input, one token at a time, and featurize
CRF features
  • Bush Chauncey, commissioner, 157 B'way, h. Rathbun's hotel
    • current_word-is_start_of_sentence: TRUE
    • current_word-is_end_of_sentence: FALSE
    • current_word-contains_a_number: FALSE
    • current_word-character_capitalized: TRUE
    • ...
    • next_word-is_start_of_sentence: FALSE
    • ...
    • previous_word-is_start_of_sentence: NULL
    • ...
CRF feature example, iteration 1
  • Bush Chauncey, commissioner, 157 B'way, h. Rathbun's hotel
    • current_word-is_start_of_sentence: FALSE
    • current_word-is_end_of_sentence: FALSE
    • current_word-contains_a_number: FALSE
    • current_word-character_capitalized: TRUE
    • ...
    • next_word-is_start_of_sentence: FALSE
    • ...
    • previous_word-is_start_of_sentence: TRUE
    • ...
CRF feature example, iteration 2
  • We have a lot of freedom in designing what features to extract
  • Features can represent many aspects of the input
    • Word shape
    • Suffixes and prefixes
    • Types of characters in the input
    • Whether or not the input is a known entity
CRF feature engineering
  • We used Python's sklearn-crfsuite package
    • It conforms to the popular scikit-learn package API
    • It's in Python, so it's straightforward to write custom feature extractors
CRF implementation
  • What does CRF allow us to do?
    • We just have to create features that are sensitive to differences of tokens
      • But, we don't actually have to determine if those differences are useful or not!
    • CRF infers from training examples (by perceiving in terms of feature vectors) what the most important factors in predicting labels are
CRF implementation
  • How good is CRF?
    • Super good! Really, really good.
      • But we don't really have many stats to back that claim up
    • Since the majority of our input is unlabeled, it's hard to evaluate against it directly
CRF accuracy
CRF label sample
  • You can find our city directory CRF module, as well as our training data, on Github
  • We're currently using only about 70 hand-labeled training examples, and this seems to work very well
  • Additional tweaks are mostly around edge-cases and infrequent types of statements
Implementation details
Assembling training data in a spreadsheet
Example in Python
John Canvin, pilot, 309 Water, h. 23 Hamilton

{
  "pageNum": 80,
  "bbox": [62, 473, 756, 505],
  "text": "Canvin John, pilot, 309 Water, h. 23 Hamilton",
  "parsed": {
    "subjects": ["Canvin John"],
    "occupations": ["pilot"],
    "addresses":[
      ["309 Water"],
      ["23 Hamilton","h"]
    ]
  }
}
A city directory entry as structured data
Finding 23 Hamilton Street...

Hamilton Street no longer exists...

William Perris, Maps of the city of New York (1857)
Knickerbocker Village, in Google Earth

We need a database of historical addresses!

Let's use NYPL's digitized maps!

Detail of Plate 3, Atlas of the Bronx (1921)
Map Warper, our website for crowdsourced map georectification
Cropping the image, keeping only cartographic material
Showing the rectified map on top of New York City today
Stitching all pages from the atlas together
Building Inspector: from maps to historical building and address data
Crowdsourced transcription of house numbers with Building Inspector
All Building Inspector data is available via an API
Our dataset of historical house numbers
A database of historical streets, traced from maps & atlases
Tracing all streets from a 1854 atlas of Manhattan
Our dataset of historical streets (you can help us trace!)
Extract data from different sources, convert to NDJSON
Open historical datasets on spacetime.nypl.org
Datasets from the NYC Space/Time Directory visualized with QGIS
  • 861 streets
  • 74,635 house numbers
New insights from combining historical datasets
house numbers + streets = addresses
  • 861 streets
  • 74,635 house numbers
  • 48,754 addresses
A simple geocoder for historical addresses, made with Lunr
Talman Street, no. 57, Brooklyn
Fifth Fifth Avenue
RivingtonRivington Street
Av BAvenue B
W. 17thWest 17th Street
B'wayBroadway
Grace Ct.Grace Court
Thirty-First31st Street
B'ryBowery
Street names from city directories need to be normalized
Node.js library to normalize NYC street names

const normalizer = require('@spacetime/nyc-street-normalizer')

const input = [
  "13th avenue", "Vernon Blvd", "E. 35th", "Macdougal",
  "Shore Rd", "B'way", "E. Twenty-Second St",
]

input.forEach((street) => console.log(street, '⟶', normalizer(street)))
//  13th avenue ⟶ Thirteenth Avenue
//  Vernon Blvd ⟶ Vernon Boulevard
//  E. 35th ⟶ East 35th Street
//  Macdougal ⟶ MacDougal Street
//  Shore Rd ⟶ Shore Road
//  B'way ⟶ Broadway
//  E. Twenty-Second St ⟶ East 22nd Street
Example code, showing street name normalization

Normalized address from city directory:


{
  "pageNum": 80,
  "bbox": [62, 473, 756, 505],
  "text": "Canvin John, pilot, 309 Water, h. 23 Hamilton",
  "normalizedAddress": "23 Hamilton Street"
}

Historical address, with coordinates:


{
  "name": "23 Hamilton Street",
  "data": {
    "mapId":7138,
    "borough":"Manhattan"
  },
  "geometry": {
    "type": "Point",
    "coordinates": [-73.9955835, 40.7110962]
  }
}
Match normalized addresses from two datasets
  • 861 streets
  • 74,635 house numbers
  • 48,754 addresses
  • 3,133,624 city directory entries (28 volumes)
  • 967,482 geocoded entries
967,482 geocoded city directory entries
City directories, geocoded with historical address data
Occupations, counted
Names, counted

What's next?

Auto-digest register and trade weekly (1918)
Manhattan: 86th Street - Amsterdam Avenue (1918)
Locomobile “30” (1909)
The Locomobile Book (1909)

What's next?

Making all this data searchable and visible!

A new UI to search & visualize all NYC Space/Time Directory data

Thanks!

All the data will soon be available on spacetime.nypl.org

Slides:

spacetime.nypl.org/city-directory-meetup

Who wants a beer?