public class PDFMarkedContent2XHTML
extends org.apache.pdfbox.text.PDFTextStripper
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
XMP_DOCUMENT_CATALOG_LOCATION |
static java.lang.String |
XMP_PAGE_LOCATION_PREFIX |
Modifier and Type | Method and Description |
---|---|
int |
getCurrentPageNo()
we need to override this because we are overriding
processPages(PDPageTree) |
int |
getStartPage() |
static void |
process(org.apache.pdfbox.pdmodel.PDDocument pdDocument,
org.xml.sax.ContentHandler handler,
ParseContext context,
Metadata metadata,
PDFParserConfig config)
Converts the given PDF document (and related metadata) to a stream
of XHTML SAX events sent to the given content handler.
|
void |
processPage(org.apache.pdfbox.pdmodel.PDPage page) |
void |
setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) |
void |
setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) |
void |
setStartPage(int startPage) |
getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeText
addOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint
public static final java.lang.String XMP_DOCUMENT_CATALOG_LOCATION
public static final java.lang.String XMP_PAGE_LOCATION_PREFIX
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, org.xml.sax.ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws org.xml.sax.SAXException, TikaException
pdDocument
- PDF documenthandler
- SAX content handlermetadata
- PDF metadataorg.xml.sax.SAXException
- if the content handler fails to process SAX eventsTikaException
- if there was an exception outside of per page processingpublic void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws java.io.IOException
processPage
in class org.apache.pdfbox.text.PDFTextStripper
java.io.IOException
public int getCurrentPageNo()
processPages(PDPageTree)
getCurrentPageNo
in class org.apache.pdfbox.text.PDFTextStripper
public void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
setStartBookmark
in class org.apache.pdfbox.text.PDFTextStripper
public void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
setEndBookmark
in class org.apache.pdfbox.text.PDFTextStripper
public void setStartPage(int startPage)
setStartPage
in class org.apache.pdfbox.text.PDFTextStripper
public int getStartPage()
getStartPage
in class org.apache.pdfbox.text.PDFTextStripper
Copyright © 2010 - 2023 Adobe. All Rights Reserved