Oceanicsdotio

PDF parser part 1

October 18, 2020


Concept

Oh gal a doozy. Given an arbitrary multi-page PDF, extract an “image”, then extract all quantitative and categorical information, parse and save in a interoperable format, and then generate an Accessible interface (audio or braille).

I say “image” because sometimes the whole document is an image, and other times the images are like SVG and are made up of individual elements.

The original implementation was written for specific documents and features. I’ll focus on more general applications here, and use example documents that I have rights to.

Methods

We need to read in a PDF file and extract a specific image.

Parsing an image might extract:

  1. coordinates of data points, and relation to axes
  2. map visual features like lines and points to the figure legend
  3. text labels of axes and visual features

Dependencies

We’ll use pdfjs-dist (from Mozilla), pluralize (because although it seems simple, english pluralization is not), and eventually the emscripten port of the Tesseract optical character recognition library. We

The former is disgracefully under-documented, but there are enough community posts and projects to learn the basics without reading through the whole API code. Woot.

The later, Tesseract, is necessary for two reasons. First, images contain rendered text, and we will have to extract this. Second, older PDFs are Just Images, and have to be processed the same way as images embedded in Modern documents.

Recent information is more accessible because it is likely to interoperate with modern systems. Archaic formats and physical copies are sometimes the only remaining record containing historical data. This is how archaeologists make their living.

Display a page

The most basic use of a PDF library is to view and browse a document. Sometimes you may just need to reference a single page, or a figure from a single page.

To view a PDF page, we probably want something like:

<PDF doc={`/johnson-etal-2019-sesf.pdf`} scale={1} pageNumber={2}/>

The scale property adjusts the resolution of the rendered document relative to the original. Most PDFs contain text data that are rendered by whatever viewer you happen to be using. So scale=1 will look the same as you might expected. Using scale=0.1 instead will produce a blurred image, and scale=10 will produce a very high resolution canvas version.

This works, and renders to:

Pretty convenient. I supply the document as a static file, cause I am one of the authors. Otherwise, you might not host the document, but supply a link to an open access document.

Remember copyright and all that, and protect yourself with good security and obfuscation if you are going to push the limits of Fair Use.

If you take a look at the PDF.jsx component, we’re going to be using a fair number of library features:

// PDF.jsx
import React, {useEffect, useState, useRef} from 'react';
import {getDocument, version, GlobalWorkerOptions, OPS} from 'pdfjs-dist';
import {singular} from "pluralize";

Hooks, love React hooks, but seek elsewhere for explaination.

The PDF library is asynchronous, you’ll see in a sec what I mean. From pdfjs we use the getDocument() function. All other functionality is going to happen through method calls to the document and page object APIs.

The PDF library also uses web workers! Fancy. This means that processing of the document happens in the background without blocking interaction in the browser.

Mozilla and Cloudflare take care of the implementation details for us:

// PDF.jsx
GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/${version}/pdf.worker.js`;

As we extract objects, we’ll save them so they can be reused. We’ll decode the document, cache, get a page, cache, get content and images… etc.

The more decoupled it is the better, because certain steps might fail for particular documents. Since we don’t control the original source material, fail gracefully and do as much as possible.

Our document state and loader hook (lives inside the component):

// PDF.jsx
const [pdf, setPdf] = useState(null);
useEffect(()=>{
    // Save source document to React state
    getDocument(doc).promise.then((pdfData) => {setPdf(pdfData)});
},[]);

The async part is the .promise.then(). Because we are using React state setters as callbacks, we don’t actually have to use async/await syntax in hook. We just make other hooks dependent on state variables having been previously set.

For isntance, a step to extract the requested page:

// PDF.jsx
const [page, setPage] = useState(null);
useEffect(()=>{
    // Save current page
    if (pdf) pdf.getPage(pageNumber).then(pg => {setPage(pg)});
},[pdf]);

Once we have the page data, we can render that page to a canvas element.

You could create the element programmatically, or utilize the useRef hook. We use getViewport() and render() methods to size and draw the page data using the pdfjs API:

// PDF.jsx
const ref = useRef(null);
useEffect(()=>{
    // Render full page to primary HTML canvas element
    if (!page || !ref.current) return;
    const viewport = page.getViewport({ scale });
    const {width, height} = viewport;
    ref.current.height = height;
    ref.current.width = width;

    page.render({viewport, canvasContext: ref.current.getContext('2d')});
},[page]);

The ref handle is supplied as a property to the component, which sets the ref.current reference to itself (<StyledCanvas ref={ref} hidden={false}/>);

If you want to save the page data inside React state, instead of a local storage option, you can use hidden=true.

Extract text content

We can extend this process by also extracting the text.

PDF viewer applications normally draw the content as an image, and then render a transparent placeholder layer over that allows for text highlighting and search.

Our effect hook to get text content for the page is:

const [pageContent, setPageContent] = useState(null);
useEffect(()=>{
    if (page) page.getTextContent().then(pgc => {setPageContent(pgc)});
},[page]);

If you wanna do the same for all pages, you can map and reduce the arrays.

The resulting object contains an items object array and a styles object. The items objects have values that index into styles.

What’s it about?

A common application is indexing. You might want to find a page of a document that contains a word or phrase.

Maybe order of appearance matters, or maybe not? Let’s not worry about it now, and assume that we can enhance the data structure below to include the line number and position of occurence.

Here goes nothing:

const [lexicon, setLexicon] = useState(null);
useEffect(()=>{

    if (!pageContent) return;

    const regExp = /[^a-z\-\']/gmi;
    let vocabularySize = 0;

    (async () => {

        const stopWords = new Set(await fetch('/stopwords.json')
            .then(r => r.json()));

        setLexicon(Object.entries(
            pageContent.items
            .reduce((a, b) => a + b.str, "")
            .split(/\s+/g)
            .map(word => word.toLowerCase())
            .map(word => word.replace(regExp, ""))
            .map(word => singular(word))
            .filter(word => word && !stopWords.has(word))
            .reduce((a, word) => {   
                vocabularySize += 1;    
                if (word in a) a[word] += 1;
                else a[word] = 1;
                return a;
            }, {})
        ).sort(
            (a, b) => a[1] < b[1]
        ));
    })();
},[pageContent]);

Stop words are the words that are ignored in indexing. I pulled these from a list online.

Then we take the strings and concatnate them, because sometime words are split across lines. We split on whitespace, tokenize the words, filter out stop words, count, sort, and save ‘em to React state.

Our group could really be better about using a greater variety of language. This one page has 309 meaningful words (of 475 total), aka they do not appear on the stop list. 189 of these words are unique, when corrected for pluralization.

To be honest, the paper is kind of a drag to read because of this. Let’s take the most frequent words (or acronyms): aquaculture (14), system (12), framework (11), marine (8), social (7), and research (7).

There are also acronyms: sesf (6) (socioecological system framework), ecological (5), se (5) (socioecological), sustainability (4), and sess (4) (socioecological systems). ​​

That’s 83 of 309 meaningful words used to basically say “A Social-Ecological System Framework for Marine Aquaculture Research”, which is the title.

I hope you can see how this might be helpful in improving your own writing.

:-)

Next time

Next time we do the image stuff. I have to find a good paper to demonstate with, maybe my thesis?

Johnson T, Beard K, Brady D, Byron C, Cleaver C, Duffy K, Keeney N, Kimble M, Miller M, Moeykens S, Teisl M, van Walsum G, Yuan J. 2019. A Social-Ecological System Framework for Marine Aquaculture Research. Sustainability 11 (9):2522–undefined.[links]