SILVERCODERS DocToText
4.0.1512
Converts DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML and HTML documents to plain text. Extracts metadata and annotations.
|
DocToText - Converts DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML and HTML documents to plain text. Extracts metadata and annotations.
Binaries have been compiled using GCC, so usage will be very simple here. We distribute binaries within one single catalog, where we can find all necessary files (include files -> .h, library files -> .dll, .so, .dylib). So, all we have to do is to modify few options: LDFLAGS (-L/path/to/doctotext) and CXXFLAGS (-I/path/to/doctotext). Also, do not forget to specify LD_LIBRARY_PATH. It must also contain path to doctotext. If we forget about this, we may get undefined references. Finally, we have to add one more option to the linker: "-ldoctotext". Now we can create example file, main.cpp
The shortest way to build example program is to execute this command: LD_LIBRARY_PATH=./doctotext g++ -o example main.cpp -I./doctotext/ -L./doctotext/ -ldoctotext Of course, ./doctotext is a catalog with include and library files we distribute. Create some .doc file named example.doc. Put within executable file. Now we can run application: LD_LIBRARY_PATH=./doctotext example We should be able to see extracted text, author of the file, and a person who last has modified it. There is one more thing to remember: There is "resources" catalog inside our "doctotext" dir. It is used by PDF parser. We have to put this catalog in the same path where executable is, otherwise PDF parser may fail sometimes.
You can build application using doctotext in similiar way as with GCC. But there is one important thing you need to know. You must not use any function from doctotext which requires use of Standard Template Library (STL). To be sure, do not use any object from std namespace. The reason is that we are using GCC and implementation of STL differs too much from the one provided by CFront. But it is possible not to use STL, since we provide API which do not use that library. Simple rewrite main.cpp from previous chapter in following way:
Note that text and file_name are type of char*, not std::string. Now everything should be working as expected.
Binaries for windows have been compiled with MinGW, thus cannot be used in MSVC enviroment easily. But it does not concern C API. On this level, MinGW libraries are compatible with MSVC. Knowing about that, we provide additional file: doctotext_c_api.h. It contains, as indicated by the file name, list of functions which use C-naming conventions. Thanks to this API, we can use binaries produced by MinGW in other enviroments, like MSVC. In order to compile sample program with MSVC, we need following files: libdoctotext.a and doctotext_c_api.h. In compiler options, we have to add two paths: one include path (we have to specify directory where doctotext_c_api.h lies) and one library path (where we need to provide path to libdoctotext.a). Now, in linker options we have to add one file: "libdoctotext.a". That's all. We can create main.cpp file:
Now we can compile our simple program. Of course, do not forget to put all *.dll into directory where executable program is. That's all, program should be working now, if we do not forget to provide example.doc file of course. There is one more thing to remember: Within binaries dir, there is "resources" catalog. It is used by PDF parser. We have to put this catalog in the same path where executable is, otherwise PDF parser may fail sometimes.