SILVERCODERS DocToText  4.0.1512
Converts DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML and HTML documents to plain text. Extracts metadata and annotations.
 All Classes Functions Enumerations Pages
Public Types | Public Member Functions | List of all members
doctotext::PlainTextExtractor Class Reference

#include <plain_text_extractor.h>

Public Types

enum  ParserType {
  PARSER_AUTO, PARSER_RTF, PARSER_ODF_OOXML, PARSER_XLS,
  PARSER_DOC, PARSER_PPT, PARSER_HTML, PARSER_IWORK,
  PARSER_XLSB, PARSER_PDF, PARSER_TXT, PARSER_EML,
  PARSER_ODFXML
}
 

Public Member Functions

 PlainTextExtractor (ParserType parser_type=PARSER_AUTO)
 
void setVerboseLogging (bool verbose)
 
void setLogStream (std::ostream &log_stream)
 
void setFormattingStyle (const FormattingStyle &style)
 
void setXmlParseMode (XmlParseMode mode)
 
void setManageXmlParser (bool manage)
 
ParserType parserTypeByFileExtension (const std::string &file_name)
 
ParserType parserTypeByFileExtension (const char *file_name)
 
bool parserTypeByFileContent (const std::string &file_name, ParserType &parser_type)
 
bool parserTypeByFileContent (const char *file_name, ParserType &parser_type)
 
bool parserTypeByFileContent (const char *buffer, size_t size, ParserType &parser_type)
 
bool processFile (const std::string &file_name, std::string &text)
 
bool processFile (const char *file_name, char *&text)
 
bool processFile (const char *buffer, size_t size, char *&text)
 
bool processFile (const char *buffer, size_t size, std::string &text)
 
bool processFile (ParserType parser_type, bool fallback, const std::string &file_name, std::string &text)
 
bool processFile (ParserType parser_type, bool fallback, const char *file_name, char *&text)
 
bool processFile (ParserType parser_type, bool fallback, const char *buffer, size_t size, char *&text)
 
bool processFile (ParserType parser_type, bool fallback, const char *buffer, size_t size, std::string &text)
 
bool extractMetadata (const std::string &file_name, Metadata &metadata)
 
bool extractMetadata (const char *file_name, Metadata &metadata)
 
bool extractMetadata (const char *buffer, size_t size, Metadata &metadata)
 
bool extractMetadata (ParserType parser_type, bool fallback, const std::string &file_name, Metadata &metadata)
 
bool extractMetadata (ParserType parser_type, bool fallback, const char *file_name, Metadata &metadata)
 
bool extractMetadata (ParserType parser_type, bool fallback, const char *buffer, size_t size, Metadata &metadata)
 
size_t getNumberOfLinks () const
 
void getParsedLinks (std::vector< Link > &links) const
 
void getParsedLinks (const Link *&links, size_t &number_of_links) const
 
const LinkgetParsedLinks () const
 
void getAttachments (std::vector< Attachment > &attachments) const
 
void getAttachments (const Attachment *&attachments, size_t &number_of_attachments) const
 
const AttachmentgetAttachments () const
 
size_t getNumberOfAttachments () const
 

Detailed Description

Extracts plain text from documents. In addition it can be used to extract metadata and comments (annotations). Example of usage (extracting plain text):

std::string text;
if (extractor.processFile("example.doc", text))
std::cout << text << std::endl;
else
std::cerr << "Error." << std::endl;

Example of usage (extracting metadata):

Metadata meta;
if (extractor.extractMetadata("example.doc", meta))
std::cout << meta.author << std::endl;
else
std::cerr << "Error." << std::endl;

Note that each instance of PlainTextExtractor should be used in single thread only. One instance of this object cannot parse two or more files in parallel.

Member Enumeration Documentation

Enumerates all supported document formats. PARSER_AUTO means unknown format that should be determined.

Constructor & Destructor Documentation

doctotext::PlainTextExtractor::PlainTextExtractor ( ParserType  parser_type = PARSER_AUTO)

The constructor.

Parameters
parser_typerestricts parser to specified document format. If set to PARSER_AUTO the parser will work with all supported documents formats.

Member Function Documentation

bool doctotext::PlainTextExtractor::extractMetadata ( const std::string &  file_name,
Metadata metadata 
)

Parses specified document and extracts metadata (author, creation time, etc).

Parameters
file_namefull path to file containing document.
metadatareference to object of Metadata class that will contain extracted information.
Returns
true if document was processed successfully, false otherwise.
See Also
ParserType Metadata
bool doctotext::PlainTextExtractor::extractMetadata ( const char *  file_name,
Metadata metadata 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

bool doctotext::PlainTextExtractor::extractMetadata ( const char *  buffer,
size_t  size,
Metadata metadata 
)

Parses specified document and extracts metadata (author, creation time, etc). Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters
bufferpointer to the file content array
sizesize of buffer
bool doctotext::PlainTextExtractor::extractMetadata ( ParserType  parser_type,
bool  fallback,
const std::string &  file_name,
Metadata metadata 
)

Parses specified document and extracts metadata (author, creation time, etc).

Parameters
parser_typerestricts parser to specified document format. If set to PARSER_AUTO the parser will work with all supported documents formats. This argument override parser type set for the object.
fallbackif true parser will try to detect document format if parsing of document format specified in parser_type argument fails. This parameter is ignored if parser_type is set to PARSER_AUTO.
file_namefull path to file containing document.
metadatareference to object of Metadata class that will contain extracted information.
Returns
true if document was processed successfully, false otherwise.
See Also
ParserType Metadata
bool doctotext::PlainTextExtractor::extractMetadata ( ParserType  parser_type,
bool  fallback,
const char *  file_name,
Metadata metadata 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

bool doctotext::PlainTextExtractor::extractMetadata ( ParserType  parser_type,
bool  fallback,
const char *  buffer,
size_t  size,
Metadata metadata 
)

Parses specified document and extracts metadata (author, creation time, etc). Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters
bufferpointer to the file content array
sizesize of buffer
void doctotext::PlainTextExtractor::getAttachments ( std::vector< Attachment > &  attachments) const

Gets vector of the attachments in the last parsed file. Only EML parser is supported for now. Call this method after you have processed file.

See Also
Attachment
void doctotext::PlainTextExtractor::getAttachments ( const Attachment *&  attachments,
size_t &  number_of_attachments 
) const

Gets table of the attachments in the last parsed file. Only EML parser is supported for now. Call this method after you have processed file. Note that table of attachments will be deleted, if PlainTextExtractor is deleted or another file is parsed. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters
attachmentspointer to the first element in table of Attachment objects.
number_of_attachmentsnumber of attachments in table.
See Also
Attachment
const Attachment* doctotext::PlainTextExtractor::getAttachments ( ) const

Gets table of the attachments in the last parsed file. Only EML parser is supported for now. Call this method after you have processed file. Note that table of attachments will be deleted, if PlainTextExtractor is deleted or another file is parsed. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Returns
pointer to the first element in table of Attachment objects.
See Also
Attachment
size_t doctotext::PlainTextExtractor::getNumberOfAttachments ( ) const

Gets number of attachments in the last parsed file. Only EML parser is supported for now. Call this method after you have processed file.

Returns
number of the attachments in parsed file.
See Also
Attachment
size_t doctotext::PlainTextExtractor::getNumberOfLinks ( ) const

Gets number of links in the last parsed file. Supported parsers: HTML/EML/ODF_OOXML/ODFXML.

Returns
number of the links in parsed file.
See Also
Link
void doctotext::PlainTextExtractor::getParsedLinks ( std::vector< Link > &  links) const

Gets vector of the links in the last parsed file. Supported parsers: HTML/EML/ODF_OOXML/ODFXML.

See Also
Link
void doctotext::PlainTextExtractor::getParsedLinks ( const Link *&  links,
size_t &  number_of_links 
) const

Gets table of the links in the last parsed file. Supported parsers: HTML/EML/ODF_OOXML/ODFXML. Note that table of links will be deleted, if PlainTextExtractor is deleted or another file is parsed. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters
linkspointer to the first element in table of Link objects.
number_of_linksnumber of links in table.
See Also
Link
const Link* doctotext::PlainTextExtractor::getParsedLinks ( ) const

Gets table of the links in the last parsed file. Supported parsers: HTML/EML/ODF_OOXML/ODFXML. Note that table of links will be deleted, if PlainTextExtractor is deleted or another file is parsed. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Returns
pointer to the first element in table of Link objects.
See Also
Link
bool doctotext::PlainTextExtractor::parserTypeByFileContent ( const std::string &  file_name,
ParserType parser_type 
)

Tries to determine document format by file content.

Parameters
file_namefull path to file containing document.
referenceto variable of ParserType type that will contain determined document format or PARSER_AUTO if document format cannot be determined.
Returns
true if document was processed successfully, false otherwise.
See Also
ParserType parserTypeByFileExtension
bool doctotext::PlainTextExtractor::parserTypeByFileContent ( const char *  file_name,
ParserType parser_type 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

bool doctotext::PlainTextExtractor::parserTypeByFileContent ( const char *  buffer,
size_t  size,
ParserType parser_type 
)

Tries to determine document format by file content. Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters
bufferpointer to the file content array
sizesize of buffer
ParserType doctotext::PlainTextExtractor::parserTypeByFileExtension ( const std::string &  file_name)

Tries to determine document format by file name extension.

Warning
Some applications save CSV documents with "xls" extension, RTF documents with "doc" extension or HTML documents with "xls" or "doc" extension. In such a situation this simple test will fail.
Parameters
file_namefile name or full path to file.
Returns
value of ParserType type representing determined document format or PARSER_AUTO if document format cannot be determined.
See Also
ParserType parserTypeByFileContent
ParserType doctotext::PlainTextExtractor::parserTypeByFileExtension ( const char *  file_name)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

bool doctotext::PlainTextExtractor::processFile ( const std::string &  file_name,
std::string &  text 
)

Parses specified document and extracts plain text.

Parameters
file_namefull path to file containing document.
textreference to object of std::string class that will contain produced plain text.
Returns
true if document was processed successfully, false otherwise.
See Also
ParserType setFormattingStyle
bool doctotext::PlainTextExtractor::processFile ( const char *  file_name,
char *&  text 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters
textreference to pointer that will point to produced plain text in form of null-terminated array of chars. The caller is responsible for deleting the buffer using delete[] operator.
bool doctotext::PlainTextExtractor::processFile ( const char *  buffer,
size_t  size,
char *&  text 
)

Parses specified document and extracts plain text. Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters
bufferpointer to the file content array
sizesize of buffer
bool doctotext::PlainTextExtractor::processFile ( const char *  buffer,
size_t  size,
std::string &  text 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

bool doctotext::PlainTextExtractor::processFile ( ParserType  parser_type,
bool  fallback,
const std::string &  file_name,
std::string &  text 
)

Parses specified document and extracts plain text.

Parameters
parser_typerestricts parser to specified document format. If set to PARSER_AUTO the parser will work with all supported documents formats. This argument override parser type set for the object.
fallbackif true parser will try to detect document format if parsing of document format specified in parser_type argument fails. This parameter is ignored if parser_type is set to PARSER_AUTO.
file_namefull path to file containing document.
textreference to object of std::string class that will contain produced plain text.
Returns
true if document was processed successfully, false otherwise.
See Also
ParserType setFormattingStyle
bool doctotext::PlainTextExtractor::processFile ( ParserType  parser_type,
bool  fallback,
const char *  file_name,
char *&  text 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters
textreference to pointer that will point to produced plain text in form of null-terminated buffer. The caller is responsible for deleting the buffer using delete[] operator.
bool doctotext::PlainTextExtractor::processFile ( ParserType  parser_type,
bool  fallback,
const char *  buffer,
size_t  size,
char *&  text 
)

Parses specified document and extracts plain text. Uses memory buffer instead of file. This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Parameters
bufferpointer to the file content array
sizesize of buffer
bool doctotext::PlainTextExtractor::processFile ( ParserType  parser_type,
bool  fallback,
const char *  buffer,
size_t  size,
std::string &  text 
)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

void doctotext::PlainTextExtractor::setFormattingStyle ( const FormattingStyle style)

Sets how tables, lists and urls should be formatted in plain text produced by them parser.

Parameters
styleinstance of structure FormattingStyle that specifies formatting style.
See Also
FormattingStyle
void doctotext::PlainTextExtractor::setLogStream ( std::ostream &  log_stream)

Assign an output stream that will be used for logging messages and errors. It can be used to capture logs to a file, string or show them in dialog. std::cerr stream is used by default.

Parameters
log_streamthe stream that will be used for logging
See Also
setVerboseLogging
void doctotext::PlainTextExtractor::setManageXmlParser ( bool  manage)

Enables or disables managing libxml2 parser by the object. If it is enabled (default) PlainTextExtractor object calls xmlInitParser() and xmlCleanupParser() functions automatically. All PlainTextExtractor objects uses a common thread-safe counter for this purpose. This is good if you are not using libxml2 elsewhere in the application. It it is disabled it is your responsibility to call xmlInitParser() and xmlCleanupParser().

Parameters
manageif true managing will be enabled, if false managing will be disabled.
void doctotext::PlainTextExtractor::setVerboseLogging ( bool  verbose)

Enables or disables verbose logging. Verbose logging is disabled by default. If verbose logging is disabled only important messages and errors are logged. If verbose logging is enabled all messages and errors are logged.

Warning
Verbose logging can produce a lot of text, especially if the library was compiled in debug mode.
Parameters
verboseif true verbose logging will be enabled. If false verbose logging will be disabled.
See Also
setLogStream

The documentation for this class was generated from the following file: