Download Rjava Package In R
Introduction
Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.
The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.
Installation
On Windows and Mac the binary packages can be installed directly from CRAN:
Installation on Linux requires the poppler development library. On Ubuntu 16.04 (Xenial) and Ubuntu 18.04 (Bionic) we have backports that support the latest pdf_data()
functionality:
sudo add-apt-repository -y ppa:cran/poppler sudo apt-get update sudo apt-get install -y libpoppler-cpp-dev
On other versions of Debian or Ubuntu simply use::
sudo apt-get install libpoppler-cpp-dev
If you want to install the package from source on MacOS you need brew:
brew install poppler
On Fedora:
sudo yum install poppler-cpp-devel
Building from source
On Ubuntu
Update: Itt is now recommended to use the backport PPA mentioned above. If you really want to build from source, follow the instructions of this askubuntu.com answer.
On CentOS
On CentOS the libpoppler-cpp
library is not included with the system so we need to build from source. Note that recent versions of poppler require C++11 which is not available on CentOS, so we build a slightly older version of libpoppler.
By default libraries get installed in /usr/local/lib
and /usr/local/include
. On CentOS this is not a default search path so we need to set PKG_CONFIG_PATH
and LD_LIBRARY_PATH
to point R to the right directory:
We can then start R and install pdftools
.
Getting started
The ?pdftools
manual page shows a brief overview of the main utilities. The most important function is pdf_text
which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.
In addition, the package has some utilities to extract other data from the PDF file. The pdf_toc
function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:
# Table of contents toc <- pdf_toc ( "1403.2805.pdf" ) # Show as JSON jsonlite :: toJSON ( toc, auto_unbox = TRUE, pretty = TRUE )
Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.
# Author, version, etc info <- pdf_info ( "1403.2805.pdf" ) # Table with fonts fonts <- pdf_fonts ( "1403.2805.pdf" )
Bonus feature: rendering pdf
A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use pdf_render_page
to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.
# renders pdf to bitmap array bitmap <- pdf_render_page ( "1403.2805.pdf", page = 1 ) # save bitmap image png :: writePNG ( bitmap, "page.png" ) jpeg :: writeJPEG ( bitmap, "page.jpeg" ) webp :: write_webp ( bitmap, "page.webp" )
This feature is still experimental and currently does not work on Windows.
Tables
Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data with pdftools
.
txt <- pdf_text ( "http://arxiv.org/pdf/1406.4806.pdf" ) # some tables cat ( txt [ 18 ] ) cat ( txt [ 19 ] )
The tabulizer
package is dedicated to extracting tables from PDF, and includes interactive tools for selecting tables. However, tabulizer
depends on rJava
and therefore requires additional setup steps or may be impossible to use on systems where Java cannot be installed.
It is possible to use pdftools
with some creativity to parse tables from PDF documents, which does not require Java to be installed.
Source: https://docs.ropensci.org/pdftools/
Posted by: swiftpedestal.blogspot.com