Abstract

In this blog post, we explored the process of automating the extraction and summarization of content from PDF documents using OpenAI’s ChatGPT model. We developed a workflow in R that converts PDFs to text, sends the text to ChatGPT for summarization, and then automated this process for batch processing of multiple PDFs. This approach can save time and effort when dealing with a large number of documents and can be easily customized for different instructions and datasets.

Introduction

Not soon after an API for the large language model ChatGPT was released, tools such as Hipdf followed, where one can upload a PDF and ask a GPT-powered system questions regarding the PDFs content. However, doing this repeatedly for dozens or hundreds of documents is not really feasible, as one has to repeatedly upload PDFs and give the instructions.

This is why in this blog post, we are looking into a solution to recreating the process performed by Hipdf and subsequently automating said process in R. Specifically, we will do this in three steps: First, a PDF document has to be converted to text. Second, that text is passed to ChatGPT to generate a response. In the last step, we will automate the prior two to allow for a batch processing of PDFs.

In order to run the R code in this blog post, you’ll have to activate the following four packages.

library(openai)
library(pdftools)
library(magrittr)
library(stringr)

Set up your OpenAI API

This section describes how you can set up an API for the models released by OpenAI, analogous to my previous blog post.

Before you can try this yourself, you’ll have to set up an OpenAI account. The company is charging for the usage of their models, however, for sporadic usage, it’s not very pricey. Using gpt-3.5-turbo (a faster version of ChatGPT) currently costs 0.002 USD for 1000 generated tokens¹, whereas images created by DALL-E 2 go for two cents each.

¹ In the context of large language models like GPT-3, a token refers to a unit of text that the model processes during both the training and inference stage. Tokens can be individual characters, words, or subwords, depending on the tokenization method used. In the case of OpenAI’s tokenization, one word in English equals about 1.33 tokens on average.

The wrapper for the OpenAI API is called openai and can be downloaded directly from CRAN (1). In order to use the API, you’ll have to provide an API key. For this, sign up for the OpenAI API, and generate your personal secret API key on this site (Personal, and select View API keys in drop-down menu). You can then copy the key by clicking on the green text Copy.

The secret key will be used in every function that calls a model via the API, but alternatively, you may also set up a global variable such that you only pass the key once.

Sys.setenv(OPENAI_API_KEY = "<SECRET KEY>")

Make sure to replace <SECRET KEY> with your actual secret API key. Once this is done, you will be able to use OpenAI’s models directly from within R, and you won’t have to pass the API key to any functions from the opanai package.

Summarizing a single PDF

As an example, we’ll have a look at the famous paper “Attention is all you need” written by some Google researchers (2). The publication introduces the transformer, a neural network architecture that constitutes the foundation for recent advances in natural language processing and the subsequent development of OpenAI’s GPT series.

Text extraction

First, we’ll save the URL to the publicly available PDF article as a variable called Vaswani2017.

Vaswani2017 <- "https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"

In a first step, we will write a function pdf2text which – as the name suggests – converts a PDF to raw text.

pdf2text <- function(path, numbering = TRUE, references = TRUE){
   text <- pdftools::pdf_text(path) %>%
      paste0(collapse = " ") %>%
      paste0(collapse = " ") %>%
      stringr::str_squish()
   if(brackets) {
      target <- "\\[.*?\\]"
      text <- gsub(target, "", text)
   }
   if(references) {
      target <- "\\bReferences|references|REFERENCES\\b"
      text <- strsplit(text, target) %>%
         unlist %>%
         head(., -1) %>%
         paste(., collapse = " ")
   }
   return(text)
}

1: Use the R package pdftools to read in a PDF from a specified path and get rid of the tab stops and white spaces.
2: Optionally cut out all content in brackets – which in academic papers are expected to be the references to citations.
3: Optionally cut out all content following the word references – case insensitively. Note that in case the target word occurs two times in the text, only the last part will be removed.

In addition to converting the PDF article to text, we also remove the references – both within the text as well as at the end of the document.

Sending the prompt to ChatGPT

Next, we define the function getPrompt, which will glue together both the extracted text as well as a specific instruction we want to give to ChatGPT.

getPrompt <- function(text, instruction){
   paste0(text, "\n\n", instruction)
}

Finally, we define a function getResponse, which will use OpenAI’s API to send a prompt to ChatGPT and return its answer. Specifically, we will use the large language model with the name gpt-3.5-turbo-16k-0613. This is a fast version of ChatGPT with massively expanded token context window. This means that the model can process much larger texts at the same time – enough to read a standard size academic paper in seconds!

getResponse <- function(prompt){
   openai::create_chat_completion(
      model = "gpt-3.5-turbo-16k-0613",
      messages = list(list(role = "user",
                           content = prompt))
   )$choices$message.content
}

Now, we may put everything together into a single function called pdf2response.

pdf2response <- function(path, instruction){
   text <- pdf2text(path)
   prompt <- getPrompt(text, instruction)
   getResponse(prompt)
}

This function allows us to generate a summary of a PDF with minimal code.

pdf2response(Vaswani2017, "Please summarize this text in five sentences.")

Since the GPT series models work stochastically, running the above function will generate a different response each time. However, what the model writes about the paper seems accurate.

The paper presents a new model for sequence transduction tasks called the Transformer. Unlike traditional models that rely on recurrent or convolutional neural networks, the Transformer model is based solely on attention mechanisms and does not require sequential computation. The model uses self-attention to generate representations of the input and output sequences and connects the encoder and decoder through attention mechanisms. The Transformer model achieves state-of-the-art results on machine translation tasks, outperforming previous models and ensembles. The model is more parallelizable, faster to train, and allows for learning long-range dependencies in sequences.

Automating the workflow

At last, we want to write a simple for-loop that repeats the same instruction for multiple PDF files. The aim should be to summarize all the papers in a folder with the name "<PATH>" while ignoring other files in the folder. Thus, we first write the function getPDF, which returns the names of all files in the folder that are in fact PDFs.

getPDFs <- function(path) dir(path)[grep(".pdf",dir(path))]

With that being done, we can summarize all the PDF files in the given folder with the following code.

instruction <- "Please summarize this text in five sentences."

sink("summary.txt")
for (i in getPDFs(<PATH>)) {
   cat("\n\n#",gsub(".pdf", "", i))
   summary <- pdf2response(file.path(<PATH>, i), instruction)
   cat("\n\n",summary)
}
sink()

The code will automatically create a text file named "summary.txt" in your working directory which contains all the individual PDF summaries.

Conclusion

In this blog post, we explored the process of automatically summarizing PDF documents using OpenAI’s ChatGPT model with the 16k token context window. We developed an R code that converts PDFs to text, sends the text to ChatGPT to generate a summary, and then automated the process to summarize multiple PDFs. By combining the power of PDF extraction with ChatGPT’s language generation capabilities, we can quickly generate summaries for large numbers of documents.

However, one limitation of the code presented in this blog post is that it does not work for extremely long PDFs – namely if the prompt surpasses 16’000 tokens, i.e. about 12’000 words in English. In such cases, the text may need to be split into smaller parts and summarized individually before combining the summaries into a final summary. However, this is a task for another post.

Overall, automating the process of summarizing PDFs has the potential to save time and effort, making it easier to extract information from large numbers of documents.

References

Rudnytskyi, I. Openai: R wrapper for OpenAI API. (OpenAI, 2023).

Vaswani, A. et al. Attention is all you need. in Advances in neural information processing systems (eds. Guyon, I. et al.) vol. 30 (Curran Associates, Inc., 2017).

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{oswald2023,
  author = {Oswald, Damian},
  title = {Building a {PDF} Summary System Based on {ChatGPT}},
  date = {2023-06-30},
  url = {damianoswald.com/blog/pdf-chat-gpt},
  langid = {en},
  abstract = {In this blog post, we explored the process of automating
    the extraction and summarization of content from PDF documents using
    OpenAI’s ChatGPT model. We developed a workflow in R that converts
    PDFs to text, sends the text to ChatGPT for summarization, and then
    automated this process for batch processing of multiple PDFs. This
    approach can save time and effort when dealing with a large number
    of documents and can be easily customized for different instructions
    and datasets.}
}

For attribution, please cite this work as:

Oswald, D. Building a PDF summary system based on ChatGPT . damianoswald.com/blog/pdf-chat-gpt (2023).