An example of nested downloads using RCurl. RCurl
An example of nested downloads using RCurl.

Overview

This example uses RCurl to download an HTML document and then follow each relative link within that document. The goal is to be able to start the download and pass this to the xmlTreeParsexmlEventParse functions for processing. As that parser needs more input, it calls fetches data from the HTTP response stream.

To do this, we need to use the <b>multi</b> interface for libcurl. The idea is quite simple. We initiate the download and associate a "writer" to slurp up the body of the HTTP response. This is registered with libcurl and is invoked whenever libcurl is in control and is processing the HTTP response. If there is information to be read on the HTTP stream from the server, this function reads it and appends it to a variable pending. The second part of this is that we need a function that is called by xmlEventParse which can provide input the XML parser. Of course, it will use the content coming from the HTTP server that is collected in the function getHTTPResponse. So we create a sibling function that shares the state of the getHTTPResponse function and so can see the contents of the variable pending. When the XML parser demands some input, our function supplyXMLContent checks to see if pending has non-trivial content (i.e. is not the empty string). If it has some content, it returns that. Otherwise, it tells libcurl to read some more from the HTTP stream. When it hands control to libcurl in this way, libcurl will invoke our getHTTPResponse function, populating the contents of pending. So when libcurl yields control, we will now have content to pass to the XML parser.

The only additional issue that we have to deal with in this setup is that the XML event parser asks for input up to a certain size. We cannot necessarily give it all of the content of pending. If pending has more characters than the XML parser wants, we must give it the first maxLen characters and then leave the remainder in pending for the next request from the XML parser.

The following generator function defines the two functions that do the pulling of the text from libcurl and the pushing to the XML parser.

HTTPReaderXMLParser =
function(curl, verbose = FALSE, save = FALSE)
{
   pending =  ""
   text = character()

   getHTTPResponse = 
   function(txt) {

      pending <<- paste(pending, txt, sep = "")

      if(save)
        text <<- c(text, txt)

      if(verbose) {
        cat("Getting more information from HTTP response\n")
        print(pending)
      }

      ""  # Give back something real.
   }

  supplyXMLContent = 
   function(maxLen) {
      if(verbose)
        cat("Getting data for XML parser\n")


     if(pending == "") {

         if(verbose)
            cat("Need to fetch more data for XML parser from HTTP response\n")

         while(pending == "") {
            status = curlMultiPerform(curl, multiple = TRUE)
            if(status[2] == 0) 
               break
         } 
     }

     if(pending == "") {
         # There is no more input available from this request.
       return(character())
     }


      # Now, we have the text, and we return at most maxLen - 1
      # characters
     if(nchar(pending) >= maxLen) {
        ans = substring(pending, 1, maxLen-1)
        pending <<- substring(pending, maxLen)
     } else {
        ans = pending
        pending <<- ""
     }

     if(verbose)
        cat("Sending '", ans, "' to XML\n", sep = "")

     ans
   }

   list(getHTTPResponse = getHTTPResponse,
        supplyXMLContent = supplyXMLContent,
        pending = function() pending,
        text = function() paste(text, collapse = "")
       )
}


The remaining part is how we combine these pieces with RCurl and the XML packages to do the parsing in this asynchronous, interleaved manner. The code below performs the basic steps


uri = "http://www.omegahat.org/RCurl/philosophy.xml"
uri = "http://www.omegahat.org/RDoc/overview.xml"

handle = getCurlMultiHandle()
streams = HTTPReaderXMLParser(handle)
handle = getURLAsynchronous(uri, 
                           write = streams$getHTTPResponse,
                           multiHandle = handle, perform = FALSE)

links = getDocbookLinks() 
xmlEventParse(streams$supplyXMLContent, handlers = links, saxVersion = 2)
links$links()


We create a 'multi handle'. This gives us the asynchronous behavior that returns control back to us from libcurl rather than sending the request and slurping back all the data in one single action. Next, we create our functions to do the pulling and pushing of text from HTTP to the XML parser. This is done with the call to getURLAsynchronous. Note that we tell it not to actually perform the request. We are just setting it up to be done when the XML parser requests input. This is important as this call must return so that we can call xmlEventParse. [1] The next step is to establish the XML event parser. We provide a collection of handlers that process the XML content in the way that we want (see below). And now we are off, and the XML parser will request input and the functions will read from the HTTP stream.

To process the links within the Docbook document, we are looking for each ulink element and fetching its url attribute. So we can provide a collection of handlers that consist of a function only for ulink. And it need only look at the attributes it is given and determine if there is a url entry. If there is, it appends the value to its internal collection of links. When we are finished the parsing, we can ask for this collection of links using the additional function links.

getDocbookLinks =
function()
{
 links = character()

 ulink = function(name, attrs, ns, namespaces) {
    if("url" %in% names(attrs))
      links[length(links) + 1 ] <<- attrs["url"]
 }

 list(ulink = ulink,
      links = function() links)
}


To run this code, we need to load both the RCurl and XML packages.
library(RCurl)
library(XML)

Test

This is a test that the basic asynchronous mechanism works generally. In this example, we just provide a reader that displays the text from the HTTP response on the console. We create a 'multi handle' and setup the HTTP request with our specialized reader. Then, we call curlMultiPerform to start the ball rolling. This will force one or more invocations of the function f. We can continue to loop until there is no more data available in the 'multi handle' by comparing the number of elements in the stack.