public class PDFMarkedContent2XHTML
extends org.apache.pdfbox.text.PDFTextStripper
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
| Modifier and Type | Field and Description |
|---|---|
static java.lang.String |
XMP_DOCUMENT_CATALOG_LOCATION |
static java.lang.String |
XMP_PAGE_LOCATION_PREFIX |
| Modifier and Type | Method and Description |
|---|---|
int |
getCurrentPageNo()
we need to override this because we are overriding
processPages(PDPageTree) |
int |
getStartPage() |
static void |
process(org.apache.pdfbox.pdmodel.PDDocument pdDocument,
org.xml.sax.ContentHandler handler,
ParseContext context,
Metadata metadata,
PDFParserConfig config)
Converts the given PDF document (and related metadata) to a stream
of XHTML SAX events sent to the given content handler.
|
void |
processPage(org.apache.pdfbox.pdmodel.PDPage page) |
void |
setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) |
void |
setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) |
void |
setStartPage(int startPage) |
getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeTextaddOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPointpublic static final java.lang.String XMP_DOCUMENT_CATALOG_LOCATION
public static final java.lang.String XMP_PAGE_LOCATION_PREFIX
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument,
org.xml.sax.ContentHandler handler,
ParseContext context,
Metadata metadata,
PDFParserConfig config)
throws org.xml.sax.SAXException,
TikaException
pdDocument - PDF documenthandler - SAX content handlermetadata - PDF metadataorg.xml.sax.SAXException - if the content handler fails to process SAX eventsTikaException - if there was an exception outside of per page processingpublic void processPage(org.apache.pdfbox.pdmodel.PDPage page)
throws java.io.IOException
processPage in class org.apache.pdfbox.text.PDFTextStripperjava.io.IOExceptionpublic int getCurrentPageNo()
processPages(PDPageTree)getCurrentPageNo in class org.apache.pdfbox.text.PDFTextStripperpublic void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
setStartBookmark in class org.apache.pdfbox.text.PDFTextStripperpublic void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
setEndBookmark in class org.apache.pdfbox.text.PDFTextStripperpublic void setStartPage(int startPage)
setStartPage in class org.apache.pdfbox.text.PDFTextStripperpublic int getStartPage()
getStartPage in class org.apache.pdfbox.text.PDFTextStripperCopyright © 2010 - 2023 Adobe. All Rights Reserved