How to extract data from PDF operator parameters ? It can be said that what is required to display a PDF file is “characters as pictures”, not “characters that constitute text data” , text data is not necessary for displaying PDF files , which is also from PDF files The hardest part in extracting text data. The purpose of this article is to provide some help for those who want to extract textual information from PDF and learn more about the mysteries of PDF files.
Steps to extract PDF file data
Parse the content stream
[merge pdf tool of AbcdPDF ] First, the tool needs to let the online algorithm server parse the binary data structure for the PDF file, which is called “content stream”. It is confused with “text data”, but in the PDF specification, the characters displayed on the page (that is, the sequence of “characters as pictures”) are simply referred to as “text”. The basic strategy thereafter is to read the text placed on the page from the content stream and interpret it as textual data. Note that content streams in PDF files are usually compressed.
Decompressing it with an appropriate algorithm yields data in plain text. In the following, this data in plain text format is also referred to as “content stream”.
Read content stream
Content streams consist of commands called “PDF operators” and their parameters. As you can imagine from the directives and parameters, in order to correctly extract the necessary information from the content stream, it is necessary to write a parser and implement a mechanism equivalent to a stack machine.
Get the text data from the parameters of the text drawing operator
If you use an editor to view the content stream in plain text, the TJ operator and the arguments to the Tj operator look like “text data or something”. However, even if the argument is read as it is, it cannot be used as text data.
The main reasons include the following 3:
1. The format and encoding used to store parameters depends on the implementation and font type of the PDF generation tool.
2. What you can directly understand from the parameters is how to find the information of drawing characters as pictures from a certain font, not necessarily text data.
3. The order of text data cannot be determined only by the positional relationship of TJ/Tj operators in the content stream.
The first is how to read the parameters of the TJ/Tj operator. By design, the arguments to the PDF operator used to draw text can be either “literal strings” or “hex strings”, which have completely different formats. Also, the encoding of these strings depends on the font.
The second problem is that the parameters read this way are usually not text data themselves. Especially for Japanese fonts, in many cases this parameter is nothing more than “find an identifier for the character in this font”.
To get text data, you must find its corresponding Unicode character by referencing the information elsewhere inside or outside the PDF file. The mapping table is usually contained in a PDF file named “/ToUnicode CMap”, and this information is used to convert Unicode characters from identifiers.
The third problem is that when we extract text data from a PDF file, we expect it to be “the order in which a human would read the PDF file when displayed”, but the text drawing operators are a stream in that order within the content. This means that there is no guarantee that there will be . text that can be used unless it can be determined whether adjacent text in the content stream should be adjacent in the output text data, or whether they constitute separate words with sufficient spaces or newlines between them .
How to extract data from PDF operator parameters ? This article takes three online tools, convert pdf to jpg , convert jpg to pdf, and merge pdf as examples, to explain the methods and steps for extracting data from PDF operator parameters.