- Municipal Archives Bodies in Transit Records
- Emigrant Savings Bank Test Account Books
- Fales Library Digital Downtown
- NYC NYPL Business Directories
For more see
newyorkscapes.org/project
Wilson Business Directory, 1852-53
New York City Directories: Potentials and Pitfalls
- Provide attestations of addresses, including those not on maps
- Occupations and names for the peopled landscape of the city
- Additional attributes such as widowhood of females, marking of "colored"-ness
- Chronological breadth: ~1786-1801, 1849-1922
NYU + NYPL Space/Time: Goals and Objectives
- Build full digital-image-to-Space/Time-data workflow
- Cleaning, ordering, and ingesting "At a Distance"
- Replicable workflows for other city's directories
- Maximize accuracy with minimal human correction
Tesseract and OCR
Doggett & Rode 1851/52 Directory
- Tesseract3 (with training) ➔ Tesseract4 (without training): 10-15% improvement
- OCR error rate of 5% sample with Tesseract4: 11% (with a std dev of 0.07 on post-1849 directories)
A Typology of OCR Errors and Uncertainties
Type |
Percent of Sample |
Remedy |
Blacklist character(s) |
0.94% |
Match string at a distance and swap offending character |
No whitelist characters |
4.08% |
Discard(?) |
A Typology of OCR Errors and Uncertainties
Type |
Percent of Sample | Remedy |
Words not in name/occ/street dictionaries |
2.2% |
Crowdsource corrections (if any) |
Remainders |
7.8% |
Sort errors/non-errors and match leftovers at a distance |
Goal
From OCR'd text:
Hoffmann Max, physician, 70 Suffolk
Hoffmann Paul, clerk, h 61 Ridge
Hoffmaster Charles A. carver, 220 Centre, h Westchester
Hogan Catharine, wid. John, teacher, 28 James, h 55 James
To a searchable map:
<div class='ocr_page'>
<p class='ocr_par'>
<span class='ocr_line' title="bbox 18 679 641 712;">
Hoffmann Max, physician, 70 Suffolk
</span>
<span class='ocr_line' title="bbox 19 712 569 745;">
Hoffmann Paul, clerk, h 61 Ridge
</span>
<span class='ocr_line' title="bbox 21 1200 828 1234;">
Hoffmaster Charles A. carver, 220 Centre, h West-
</span>
<span class='ocr_line' title="bbox 84 1234 562 1259;">
chester
</span>
<span class='ocr_line' title="bbox 862 619 1675 655;">
Hogan Catharine, wid. John, teacher, 28 James, h
</span>
<span class='ocr_line' title="bbox 927 653 1181 679;">
55 James
</span>
</p>
</div>
Tesseract outputs hOCR: XML for OCR output
Visualization of bounding boxes of lines from hOCR
- Remove headers, page numbers, etc.;
- Connect indented lines
Turn hOCR lines into city directory entries
Histogram of X position of hOCR lines
const stats = require('simple-statistics')
const xs = page.lines.map((line) => line.properties.bbox[0])
// One indent per column, and four clusters for
// other stuff like page numbers and headings
const columnCount = 2
const clusterCount = columnCount * 2 + 4
// Find clusters of x coordinates, sort by size
const clusters = stats
.ckmeans(xs, clusterCount)
.sort((a, b) => b.length - a.length)
// Compute mode of clusters, and use only
// the clusters we need
const columns = clusters
.slice(0, columnCount)
.map(stats.mode)
console.log(columns)
// [ 34, 840 ]
Finding columns with Simple Statistics
For all lines that are outside of 2 columns:
- Find closest line in upper-left direction ↖
- Connect them if this line is inside column ⟷
Connecting indented lines
Finding the closest line with a 2D spatial index
const rbush = require('rbush')
// Create an Rbush R-tree index of all line bounding boxes:
const tree = rbush(page.lines.length)
tree.load(page.lines.map((line, index) => {
const x = line.properties.bbox[0], y = line.properties.bbox[1]
return {minX: x, minY: y, maxX: x, maxY: y, index}
}))
Create a 2D spatial index of all lines with RBush
const knn = require('rbush-knn')
page.lines
// Only process lines that are not in one of the columns
.filter((line) => line.columnIndex === undefined)
.forEach((line, lineIndex) => {
const lineX = line.properties.bbox[0]
const lineY = line.properties.bbox[1]
const filter = (item) => {
const knnLine = page.lines[item.index]
const knnLineX = knnLine.properties.bbox[0]
const knnLineY = knnLine.properties.bbox[1]
return lineIndex !== item.index &&
knnLine.columnIndex !== undefined &&
knnLineX <= lineX && knnLineY <= lineY
})
// Find closest 1 line to [lineX, lineY]
const neighbors = knn(tree, lineX, lineY, 1, filter)
// If neighbors.length >= 0, we should connect previousLine + line
const previousLine = neighbors[0]
})
rbush-knn: k-nearest neighbors search for RBush
From:
<div class='ocr_page'>
<p class='ocr_par'>
<span class='ocr_line' title="bbox 21 1200 828 1234;">
Hoffmaster Charles A. carver, 220 Centre, h West-
</span>
<span class='ocr_line' title="bbox 84 1234 562 1259;">
chester
</span>
</p>
</div>
To:
{"lines":
[
{"text": "Hoffmaster Charles A. carver, 220 Centre, h Westchester"}
]
}
Input and output of our Node.js module for hOCR files
- How do we extract structured knowledge from a directory entry?
Problem
A sample directory sheet...
Find names of individuals and companies
Find occupations
Find predicates of space
Find addresses and locations
George Binkenburg, pianomkr. h. 924 Third av.
- name
⇒ "George Binkenburg"
- occupation
⇒ "pianomkr."
- location predicate
⇒ "h." (home)
- address
⇒ "924 Third av."
Four main types of labels
Can we do this using regular expressions?
First intuition: Regular expressions
Not really!
- OCR transcriptions are never perfect
- There are mistranscriptions in the parsed text
- This makes it very difficult to consistently tokenize the entries
First intuition: Regular expressions
- For instance, consider the (correctly) transcribed entry:
Binkenburg George, pianomkr. h. 924 Third av.
- From this, it seems reasonable to split the entry just based on "," and "." characters, and we'd be pretty close!
- Binkenburg George, pianomkr. h. 924 Third av.
First intuition: Regular expressions
- Unfortunately, we rarely see such clean parses!
- Something like this is much more likely:
- BInkenbu rg Geor ge pianomkr. h 924 Th1rd av
First intuition: Regular expressions
- And not only can we not trust the transcriptions 100%...
- We also can't predict the amount of elements in an entry without inspecting first!
- Burtnett Daniel, pres. cit. f. ins. co. 67 Wall & 64 Bowery, h. 14 Perry
- Burt Edwin C. boots & shoes, 43 Broadway, h. 327 Henry, Brooklyn
First intuition: Regular expressions
Can we do this using regular expressions, plus a ton of conditional statements?
Second intuition: A lot of conditionals!
... and we tried exactly this!
Sort of!
- The approach is:
- Since we can't assume that any delimiter is correctly parsed, we'll first just tokenize everything we can
- Bush Anna, widow of Henry J. r. 152 W. 13th
- ⇒ ["Bush", "Anna", ",", "widow", "of", "Henry", "J",
"
.", "r", ".", "152", "W", ".", "13th"]
Second intuition: A lot of conditionals!
- Then, we go through the tokens one by one, and try to figure out what label they belong to
- We can use heuristic methods for classifying each token
- Does it contain a number? If so, it's probably part of an address.
- Does it contain a known street name? If so, it's again part of an address.
- Is it a known occupation? If so, it must be an occupation name.
- ...
Second intuition: A lot of conditionals!
- This turns out to be a somewhat promising approach, but it is very tedious to engineer...
- Our accuracy is contingent on our ability to correctly:
- Think of all of the relevant heuristic methods
- Interpret and weigh the results of the heuristics correctly
Second intuition: a lot of conditionals!
Can algorithms from machine learning do this better?
Third intuition: Conditional random fields (CRF)
YES! 100% YES!
- We use a supervised-learning algorithm called "conditional random fields" (CRF)
- Instead of trying to pre-empt what factors are important in labeling, and then interpret the results, we demonstrate by way of labeled examples, and let CRF infer what is important
- Excellent blog post from NYTimes Developers on CRF
Third intuition: Conditional random fields (CRF)
- First we create training examples, at the same individual-token level as before:
- Each token is labeled in terms of what tag it is part of
CRF training data
Label |
Abbreviation |
Example |
START |
START |
(beginning of entry) |
END |
END |
(end of entry) |
Delimiter |
D |
"," , "." |
Name component |
NC |
"Alice" |
Occupation component |
OC |
"laborer" |
Position attribute |
PA |
"h." |
Address component |
AC |
"718", "Washington" |
CRF labels
- This sequence of labels accompanying the tokens is used for training, and also is the form that CRF predictions will take
- Given a sequence of labels, we can then easily parse the record into discrete elements
- START NC NC D OC OC D PA AC AC PA AC AC END
START NC NC D OC OC D PA AC AC PA AC AC END
- name occupation address 1, address 2
CRF label decoding
- Conditional random fields uses a discriminative classifier to produce the sequence of labels which maximizes:
P( word-label | word-features)
- Features encode information relevant to a word's label in a way that is sensitive to the word itself, as well as that word's location in a sequence
CRF parameter learning
- Features are usually boolean (true/false) attributes that are computed for each token
- In CRF, features describe not only the "present" token, but can also look at adjacent tokens
- We have to create features similar to the heuristic functions we used earlier
CRF feature engineering
- During training time, the algorithm will proceed through the examples, and find the optimal feature weights that maximize the probability of getting the correct label
- (... this particular task is "convex", meaning there is only one optimum, so we can find it easily)
- This process is done iteratively, often using a training algorithm called stochastic gradient descent
- (... but luckily for us, we use a library that abstracts this away!)
CRF training
- Assume that the input we are trying to classify is:
- Bush Chauncey, commissioner, 157 B'way, h. Rathbun's hotel
CRF features
- First we'll go through the input, one token at a time, and featurize
CRF features
- Bush Chauncey, commissioner, 157 B'way, h. Rathbun's hotel
- current_word-is_start_of_sentence: TRUE
- current_word-is_end_of_sentence: FALSE
- current_word-contains_a_number: FALSE
- current_word-character_capitalized: TRUE
- ...
- next_word-is_start_of_sentence: FALSE
- ...
- previous_word-is_start_of_sentence: NULL
- ...
CRF feature example, iteration 1
- Bush Chauncey, commissioner, 157 B'way, h. Rathbun's hotel
- current_word-is_start_of_sentence: FALSE
- current_word-is_end_of_sentence: FALSE
- current_word-contains_a_number: FALSE
- current_word-character_capitalized: TRUE
- ...
- next_word-is_start_of_sentence: FALSE
- ...
- previous_word-is_start_of_sentence: TRUE
- ...
CRF feature example, iteration 2
- We have a lot of freedom in designing what features to extract
- Features can represent many aspects of the input
- Word shape
- Suffixes and prefixes
- Types of characters in the input
- Whether or not the input is a known entity
CRF feature engineering
- We used Python's sklearn-crfsuite package
- It conforms to the popular scikit-learn package API
- It's in Python, so it's straightforward to write custom feature extractors
CRF implementation
- What does CRF allow us to do?
- We just have to create features that are sensitive to differences of tokens
- But, we don't actually have to determine if those differences are useful or not!
- CRF infers from training examples (by perceiving in terms of feature vectors) what the most important factors in predicting labels are
CRF implementation
- How good is CRF?
- Super good! Really, really good.
- But we don't really have many stats to back that claim up
- Since the majority of our input is unlabeled, it's hard to evaluate against it directly
CRF accuracy
- You can find our city directory CRF module, as well as our training data, on Github
- We're currently using only about 70 hand-labeled training examples, and this seems to work very well
- Additional tweaks are mostly around edge-cases and infrequent types of statements
Implementation details
Assembling training data in a spreadsheet
{
"pageNum": 80,
"bbox": [62, 473, 756, 505],
"text": "Canvin John, pilot, 309 Water, h. 23 Hamilton",
"parsed": {
"subjects": ["Canvin John"],
"occupations": ["pilot"],
"addresses":[
["309 Water"],
["23 Hamilton","h"]
]
}
}
A city directory entry as structured data
Finding 23 Hamilton Street...
Hamilton Street no longer exists...
We need a database of historical addresses!
Let's use NYPL's digitized maps!
Our dataset of historical house numbers
A database of historical streets, traced from maps & atlases
Extract data from different sources, convert to NDJSON
Datasets from the NYC Space/Time Directory visualized with QGIS
- 861 streets
- 74,635 house numbers
New insights from combining historical datasets
- 861 streets
- 74,635 house numbers
- 48,754 addresses
const normalizer = require('@spacetime/nyc-street-normalizer')
const input = [
"13th avenue", "Vernon Blvd", "E. 35th", "Macdougal",
"Shore Rd", "B'way", "E. Twenty-Second St",
]
input.forEach((street) => console.log(street, '⟶', normalizer(street)))
// 13th avenue ⟶ Thirteenth Avenue
// Vernon Blvd ⟶ Vernon Boulevard
// E. 35th ⟶ East 35th Street
// Macdougal ⟶ MacDougal Street
// Shore Rd ⟶ Shore Road
// B'way ⟶ Broadway
// E. Twenty-Second St ⟶ East 22nd Street
Example code, showing street name normalization
Normalized address from city directory:
{
"pageNum": 80,
"bbox": [62, 473, 756, 505],
"text": "Canvin John, pilot, 309 Water, h. 23 Hamilton",
"normalizedAddress": "23 Hamilton Street"
}
Historical address, with coordinates:
{
"name": "23 Hamilton Street",
"data": {
"mapId":7138,
"borough":"Manhattan"
},
"geometry": {
"type": "Point",
"coordinates": [-73.9955835, 40.7110962]
}
}
Match normalized addresses from two datasets
- 861 streets
- 74,635 house numbers
- 48,754 addresses
- 3,133,624 city directory entries (28 volumes)
- 967,482 geocoded entries
967,482 geocoded city directory entries
Occupations, counted
Names, counted
What's next?
Making all this data searchable and visible!
A new UI to search & visualize all NYC Space/Time Directory data
Who wants a beer?