SILVERCODERS DocToText
4.0.1512
Converts DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML and HTML documents to plain text. Extracts metadata and annotations.
|
#include <plain_text_extractor.h>
Public Types | |
enum | ParserType { PARSER_AUTO, PARSER_RTF, PARSER_ODF_OOXML, PARSER_XLS, PARSER_DOC, PARSER_PPT, PARSER_HTML, PARSER_IWORK, PARSER_XLSB, PARSER_PDF, PARSER_TXT, PARSER_EML, PARSER_ODFXML } |
Public Member Functions | |
PlainTextExtractor (ParserType parser_type=PARSER_AUTO) | |
void | setVerboseLogging (bool verbose) |
void | setLogStream (std::ostream &log_stream) |
void | setFormattingStyle (const FormattingStyle &style) |
void | setXmlParseMode (XmlParseMode mode) |
void | setManageXmlParser (bool manage) |
ParserType | parserTypeByFileExtension (const std::string &file_name) |
ParserType | parserTypeByFileExtension (const char *file_name) |
bool | parserTypeByFileContent (const std::string &file_name, ParserType &parser_type) |
bool | parserTypeByFileContent (const char *file_name, ParserType &parser_type) |
bool | parserTypeByFileContent (const char *buffer, size_t size, ParserType &parser_type) |
bool | processFile (const std::string &file_name, std::string &text) |
bool | processFile (const char *file_name, char *&text) |
bool | processFile (const char *buffer, size_t size, char *&text) |
bool | processFile (const char *buffer, size_t size, std::string &text) |
bool | processFile (ParserType parser_type, bool fallback, const std::string &file_name, std::string &text) |
bool | processFile (ParserType parser_type, bool fallback, const char *file_name, char *&text) |
bool | processFile (ParserType parser_type, bool fallback, const char *buffer, size_t size, char *&text) |
bool | processFile (ParserType parser_type, bool fallback, const char *buffer, size_t size, std::string &text) |
bool | extractMetadata (const std::string &file_name, Metadata &metadata) |
bool | extractMetadata (const char *file_name, Metadata &metadata) |
bool | extractMetadata (const char *buffer, size_t size, Metadata &metadata) |
bool | extractMetadata (ParserType parser_type, bool fallback, const std::string &file_name, Metadata &metadata) |
bool | extractMetadata (ParserType parser_type, bool fallback, const char *file_name, Metadata &metadata) |
bool | extractMetadata (ParserType parser_type, bool fallback, const char *buffer, size_t size, Metadata &metadata) |
size_t | getNumberOfLinks () const |
void | getParsedLinks (std::vector< Link > &links) const |
void | getParsedLinks (const Link *&links, size_t &number_of_links) const |
const Link * | getParsedLinks () const |
void | getAttachments (std::vector< Attachment > &attachments) const |
void | getAttachments (const Attachment *&attachments, size_t &number_of_attachments) const |
const Attachment * | getAttachments () const |
size_t | getNumberOfAttachments () const |
Extracts plain text from documents. In addition it can be used to extract metadata and comments (annotations). Example of usage (extracting plain text):
Example of usage (extracting metadata):
Note that each instance of PlainTextExtractor should be used in single thread only. One instance of this object cannot parse two or more files in parallel.
Enumerates all supported document formats. PARSER_AUTO
means unknown format that should be determined.
doctotext::PlainTextExtractor::PlainTextExtractor | ( | ParserType | parser_type = PARSER_AUTO | ) |
The constructor.
parser_type | restricts parser to specified document format. If set to PARSER_AUTO the parser will work with all supported documents formats. |
bool doctotext::PlainTextExtractor::extractMetadata | ( | const std::string & | file_name, |
Metadata & | metadata | ||
) |
Parses specified document and extracts metadata (author, creation time, etc).
file_name | full path to file containing document. |
metadata | reference to object of Metadata class that will contain extracted information. |
true
if document was processed successfully, false
otherwise. bool doctotext::PlainTextExtractor::extractMetadata | ( | const char * | file_name, |
Metadata & | metadata | ||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
bool doctotext::PlainTextExtractor::extractMetadata | ( | const char * | buffer, |
size_t | size, | ||
Metadata & | metadata | ||
) |
Parses specified document and extracts metadata (author, creation time, etc). Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
buffer | pointer to the file content array |
size | size of buffer |
bool doctotext::PlainTextExtractor::extractMetadata | ( | ParserType | parser_type, |
bool | fallback, | ||
const std::string & | file_name, | ||
Metadata & | metadata | ||
) |
Parses specified document and extracts metadata (author, creation time, etc).
parser_type | restricts parser to specified document format. If set to PARSER_AUTO the parser will work with all supported documents formats. This argument override parser type set for the object. |
fallback | if true parser will try to detect document format if parsing of document format specified in parser_type argument fails. This parameter is ignored if parser_type is set to PARSER_AUTO . |
file_name | full path to file containing document. |
metadata | reference to object of Metadata class that will contain extracted information. |
true
if document was processed successfully, false
otherwise. bool doctotext::PlainTextExtractor::extractMetadata | ( | ParserType | parser_type, |
bool | fallback, | ||
const char * | file_name, | ||
Metadata & | metadata | ||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
bool doctotext::PlainTextExtractor::extractMetadata | ( | ParserType | parser_type, |
bool | fallback, | ||
const char * | buffer, | ||
size_t | size, | ||
Metadata & | metadata | ||
) |
Parses specified document and extracts metadata (author, creation time, etc). Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
buffer | pointer to the file content array |
size | size of buffer |
void doctotext::PlainTextExtractor::getAttachments | ( | std::vector< Attachment > & | attachments | ) | const |
Gets vector of the attachments in the last parsed file. Only EML parser is supported for now. Call this method after you have processed file.
void doctotext::PlainTextExtractor::getAttachments | ( | const Attachment *& | attachments, |
size_t & | number_of_attachments | ||
) | const |
Gets table of the attachments in the last parsed file. Only EML parser is supported for now. Call this method after you have processed file. Note that table of attachments will be deleted, if PlainTextExtractor
is deleted or another file is parsed. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
attachments | pointer to the first element in table of Attachment objects. |
number_of_attachments | number of attachments in table. |
const Attachment* doctotext::PlainTextExtractor::getAttachments | ( | ) | const |
Gets table of the attachments in the last parsed file. Only EML parser is supported for now. Call this method after you have processed file. Note that table of attachments will be deleted, if PlainTextExtractor is deleted or another file is parsed. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Attachment
objects. size_t doctotext::PlainTextExtractor::getNumberOfAttachments | ( | ) | const |
Gets number of attachments in the last parsed file. Only EML parser is supported for now. Call this method after you have processed file.
size_t doctotext::PlainTextExtractor::getNumberOfLinks | ( | ) | const |
Gets number of links in the last parsed file. Supported parsers: HTML/EML/ODF_OOXML/ODFXML.
void doctotext::PlainTextExtractor::getParsedLinks | ( | std::vector< Link > & | links | ) | const |
Gets vector of the links in the last parsed file. Supported parsers: HTML/EML/ODF_OOXML/ODFXML.
void doctotext::PlainTextExtractor::getParsedLinks | ( | const Link *& | links, |
size_t & | number_of_links | ||
) | const |
Gets table of the links in the last parsed file. Supported parsers: HTML/EML/ODF_OOXML/ODFXML. Note that table of links will be deleted, if PlainTextExtractor
is deleted or another file is parsed. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
links | pointer to the first element in table of Link objects. |
number_of_links | number of links in table. |
const Link* doctotext::PlainTextExtractor::getParsedLinks | ( | ) | const |
Gets table of the links in the last parsed file. Supported parsers: HTML/EML/ODF_OOXML/ODFXML. Note that table of links will be deleted, if PlainTextExtractor is deleted or another file is parsed. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Link
objects. bool doctotext::PlainTextExtractor::parserTypeByFileContent | ( | const std::string & | file_name, |
ParserType & | parser_type | ||
) |
Tries to determine document format by file content.
file_name | full path to file containing document. |
reference | to variable of ParserType type that will contain determined document format or PARSER_AUTO if document format cannot be determined. |
true
if document was processed successfully, false
otherwise. bool doctotext::PlainTextExtractor::parserTypeByFileContent | ( | const char * | file_name, |
ParserType & | parser_type | ||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
bool doctotext::PlainTextExtractor::parserTypeByFileContent | ( | const char * | buffer, |
size_t | size, | ||
ParserType & | parser_type | ||
) |
Tries to determine document format by file content. Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
buffer | pointer to the file content array |
size | size of buffer |
ParserType doctotext::PlainTextExtractor::parserTypeByFileExtension | ( | const std::string & | file_name | ) |
Tries to determine document format by file name extension.
file_name | file name or full path to file. |
ParserType
type representing determined document format or PARSER_AUTO if document format cannot be determined. ParserType doctotext::PlainTextExtractor::parserTypeByFileExtension | ( | const char * | file_name | ) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
bool doctotext::PlainTextExtractor::processFile | ( | const std::string & | file_name, |
std::string & | text | ||
) |
Parses specified document and extracts plain text.
file_name | full path to file containing document. |
text | reference to object of std::string class that will contain produced plain text. |
true
if document was processed successfully, false
otherwise. bool doctotext::PlainTextExtractor::processFile | ( | const char * | file_name, |
char *& | text | ||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
text | reference to pointer that will point to produced plain text in form of null-terminated array of chars. The caller is responsible for deleting the buffer using delete [] operator. |
bool doctotext::PlainTextExtractor::processFile | ( | const char * | buffer, |
size_t | size, | ||
char *& | text | ||
) |
Parses specified document and extracts plain text. Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
buffer | pointer to the file content array |
size | size of buffer |
bool doctotext::PlainTextExtractor::processFile | ( | const char * | buffer, |
size_t | size, | ||
std::string & | text | ||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
bool doctotext::PlainTextExtractor::processFile | ( | ParserType | parser_type, |
bool | fallback, | ||
const std::string & | file_name, | ||
std::string & | text | ||
) |
Parses specified document and extracts plain text.
parser_type | restricts parser to specified document format. If set to PARSER_AUTO the parser will work with all supported documents formats. This argument override parser type set for the object. |
fallback | if true parser will try to detect document format if parsing of document format specified in parser_type argument fails. This parameter is ignored if parser_type is set to PARSER_AUTO . |
file_name | full path to file containing document. |
text | reference to object of std::string class that will contain produced plain text. |
true
if document was processed successfully, false
otherwise. bool doctotext::PlainTextExtractor::processFile | ( | ParserType | parser_type, |
bool | fallback, | ||
const char * | file_name, | ||
char *& | text | ||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
text | reference to pointer that will point to produced plain text in form of null-terminated buffer. The caller is responsible for deleting the buffer using delete [] operator. |
bool doctotext::PlainTextExtractor::processFile | ( | ParserType | parser_type, |
bool | fallback, | ||
const char * | buffer, | ||
size_t | size, | ||
char *& | text | ||
) |
Parses specified document and extracts plain text. Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
buffer | pointer to the file content array |
size | size of buffer |
bool doctotext::PlainTextExtractor::processFile | ( | ParserType | parser_type, |
bool | fallback, | ||
const char * | buffer, | ||
size_t | size, | ||
std::string & | text | ||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
void doctotext::PlainTextExtractor::setFormattingStyle | ( | const FormattingStyle & | style | ) |
Sets how tables, lists and urls should be formatted in plain text produced by them parser.
style | instance of structure FormattingStyle that specifies formatting style. |
void doctotext::PlainTextExtractor::setLogStream | ( | std::ostream & | log_stream | ) |
Assign an output stream that will be used for logging messages and errors. It can be used to capture logs to a file, string or show them in dialog. std::cerr
stream is used by default.
log_stream | the stream that will be used for logging |
void doctotext::PlainTextExtractor::setManageXmlParser | ( | bool | manage | ) |
Enables or disables managing libxml2 parser by the object. If it is enabled (default) PlainTextExtractor object calls xmlInitParser() and xmlCleanupParser() functions automatically. All PlainTextExtractor objects uses a common thread-safe counter for this purpose. This is good if you are not using libxml2 elsewhere in the application. It it is disabled it is your responsibility to call xmlInitParser() and xmlCleanupParser().
manage | if true managing will be enabled, if false managing will be disabled. |
void doctotext::PlainTextExtractor::setVerboseLogging | ( | bool | verbose | ) |
Enables or disables verbose logging. Verbose logging is disabled by default. If verbose logging is disabled only important messages and errors are logged. If verbose logging is enabled all messages and errors are logged.
verbose | if true verbose logging will be enabled. If false verbose logging will be disabled. |