Table of Contents
getURL("http://www.omegahat.org/RCurl/index.html")The idea is that we specify the URI. There are several other arguments to this function, but for the most part we don't need them. We can use HTTPS to fetch URIs securely. For example,
getURL("https://sourceforge.net")This is already more than we can do with the regular connections or built-in download.url in R. (Using an external program allows HTTPS access.) There are three different sets of arguments for the getURL function. One is named curl and we will cover this in section the section called “CURL Handles”. This is merely a way to cumulate requests on a single connection with shared options. The write function is again rather specialized. It allows us to specify an R function that is called each time libcurl has some text as part of the HTTP response. It hands this text (as a sequence of bytes) to the function so that it can process it in whatever way it deems fit. This corresponds to the writefunction option for the libcurl operation described next. We have it as an explicit argument simply because we need to use it to get the return value in a single action as the default behavior. The third set of arguments is the most general and is handled by the ... in the getURL function. With this, one can specify name-value pairs governing the actual request. There are numerous possible settings that one can specify. The basic idea is that one can set options provided by the curl_easy_setopt routine. These allow us to set parameters for many different aspects of the request. For example, we can specify additional headers for the HTTP request, or include a password for the Web site. The set of possible options can be determined via the function getCurlOptionConstants. and the set of names for the different options can be found via the command
names(getCurlOptionsConstants())This is a collection of names of options that are understood by many of the functions in the RCurl package. At present, there are 113 possible options.
sort(names(getCurlOptionsConstants())) [1] "autoreferer" "buffersize" [3] "cainfo" "capath" [5] "closepolicy" "connecttimeout" [7] "cookie" "cookiefile" [9] "cookiejar" "cookiesession" [11] "crlf" "customrequest" [13] "debugdata" "debugfunction" [15] "dns.cache.timeout" "dns.use.global.cache" [17] "egdsocket" "encoding" [19] "errorbuffer" "failonerror" [21] "file" "filetime" [23] "followlocation" "forbid.reuse" [25] "fresh.connect" "ftp.create.missing.dirs" [27] "ftp.response.timeout" "ftp.ssl" [29] "ftp.use.eprt" "ftp.use.epsv" [31] "ftpappend" "ftplistonly" [33] "ftpport" "header" [35] "headerfunction" "http.version" [37] "http200aliases" "httpauth" [39] "httpget" "httpheader" [41] "httppost" "httpproxytunnel" [43] "infile" "infilesize" [45] "infilesize.large" "interface" [47] "ipresolve" "krb4level" [49] "low.speed.limit" "low.speed.time" [51] "maxconnects" "maxfilesize" [53] "maxfilesize.large" "maxredirs" [55] "netrc" "netrc.file" [57] "nobody" "noprogress" [59] "nosignal" "port" [61] "post" "postfields" [63] "postfieldsize" "postfieldsize.large" [65] "postquote" "prequote" [67] "private" "progressdata" [69] "progressfunction" "proxy" [71] "proxyauth" "proxyport" [73] "proxytype" "proxyuserpwd" [75] "put" "quote" [77] "random.file" "range" [79] "readfunction" "referer" [81] "resume.from" "resume.from.large" [83] "share" "ssl.cipher.list" [85] "ssl.ctx.data" "ssl.ctx.function" [87] "ssl.verifyhost" "ssl.verifypeer" [89] "sslcert" "sslcertpasswd" [91] "sslcerttype" "sslengine" [93] "sslengine.default" "sslkey" [95] "sslkeypasswd" "sslkeytype" [97] "sslversion" "stderr" [99] "tcp.nodelay" "telnetoptions" [101] "timecondition" "timeout" [103] "timevalue" "transfertext" [105] "unrestricted.auth" "upload" [107] "url" "useragent" [109] "userpwd" "verbose" [111] "writefunction" "writeheader" [113] "writeinfo"Each of these and what it controls is described in the libcurl man(ual) page for curl_easy_setopt and that is the authoritative documentation. Anything we provide here is merely repetition or additional explanation. The names of the options require a slight explanation. These correspond to symbolic names in the C code of libcurl. For example, the option url in R corresponds to CURLOPT_URL in C. Firstly, uppercase letters are annoying to type and read, so we have mapped them to lower case letters in R. We have also removed the prefix "CURLOPT_" since we know the context in which they option names are being used. And lastly, any option names that have a _ (after we have removed the CURLOPT_ prefix) are changed to replace the '_' with a '.' so we can type them in R without having to quote them. For example, combining these three rules, "CURLOPT_URL" becomes url and CURLOPT_NETRC_FILE becomes netrc.file. That is the mapping scheme. The code that handles options in RCurl automatically maps the user's inputs to lower case. This means that you can use any mixture of upper-case that makes your code more readable to you and others. For example, we might write
writeFunction = basicTextGatherer()
or
HTTPHeader = c(Accept="text/html")
We specify one or more options by using the names. To make
interactive use easier, we perform partial matching on the names
relative to the set of know names. So, for example, we could specify
getURL("http://www.omegahat.org/RCurl/testPassword", verbose = TRUE)or, more succinctly,
getURL("http://www.omegahat.org/RCurl/testPassword", v = TRUE)Obviously, the first is more readable and less ambiguous. Please use the full form when writing "software". But you might use the abbreviated form when working interactively. Each option expects a certain type of value from R. For example, the following options expect a number or logical value.
[1] "autoreferer" "buffersize" [3] "closepolicy" "connecttimeout" [5] "cookiesession" "crlf" [7] "dns.cache.timeout" "dns.use.global.cache" [9] "failonerror" "followlocation" [11] "forbid.reuse" "fresh.connect" [13] "ftp.create.missing.dirs" "ftp.response.timeout" [15] "ftp.ssl" "ftp.use.eprt" [17] "ftp.use.epsv" "ftpappend" [19] "ftplistonly" "header" [21] "http.version" "httpauth" [23] "httpget" "httpproxytunnel" [25] "infilesize" "ipresolve" [27] "low.speed.limit" "low.speed.time" [29] "maxconnects" "maxfilesize" [31] "maxredirs" "netrc" [33] "nobody" "noprogress" [35] "nosignal" "port" [37] "post" "postfieldsize" [39] "proxyauth" "proxyport" [41] "proxytype" "put" [43] "resume.from" "ssl.verifyhost" [45] "ssl.verifypeer" "sslengine.default" [47] "sslversion" "tcp.nodelay" [49] "timecondition" "timeout" [51] "timevalue" "transfertext" [53] "unrestricted.auth" "upload" [55] "verbose"The connecttimeout gives the maximum number of seconds the connection should take before raising an error, so this is a number. The header option, on the other hand, is merely a flag to indicate whether header information from the response should be included. So this can be a logical value (or a number that is 0 to say FALSE or non-zero for TRUE.) At present, all numbers passed from R are converted to long when used in libcurl. Many options are specified as strings. For example, we can specify the user password for a URI as
getURL("http://www.omegahat.org/RCurl/testPassword/index.html", userpwd = "bob:duncantl", verbose = TRUE)Note that we also turned on the "verbose" option so that we can see what libcurl is doing. This is extremely convenient when trying to understand why things aren't working (or are working in a particular way!). Another example of using strings is to specify a referer URI and a user-agent.
getURL("http://www.omegahat.org/RCurl/index.html", useragent="RCurl", referer="http://www.omegahat.org")(Again, you might want to turn on the "verbose" option to see what libcurl is doing with this information.) The libcurl facilities allow us to not only set our own values for fields used in the HTTP request header (such as the referer or user-agent), but it also allows us to set an entire collection of new fields or replacements for any existing field. We do this in R using the httpheader option for libcurl and we specify a value which is a named character vector. For example, suppose we want to provide a value for the Accept field and add a new field named, say, Made-up-field. We could do this in the request as
getURL("http://www.omegahat.org/RCurl", httpheader = c(Accept="text/html", 'Made-up-field' = "bob"))If you turn on the verbose option again for this request, you will see these fields being set.
> getURL("http://www.omegahat.org", httpheader = c(Accept="text/html", 'Made-up-field' = "bob"), verbose = TRUE) * About to connect() to www.omegahat.org port 80 * Connected to www.omegahat.org (169.237.46.32) port 80 > GET / HTTP/1.1 Host: www.omegahat.org Pragma: no-cache Accept: text/html Made-up-field: bob(Note that not all servers will tolerate setting header fields arbitrarily and may return an error.) The key thing to note is that headers are specified as name-value pairs in a character vector. R takes these and pastes the name and value together and passes the resulting character vector to libcurl. So while it is convenient to express the headers as
c(name = "value", name = "value")if you already have the data in the form
c("name: value", "name: value")you can use that directly. Some of the libcurl options expect a C routine. For example, when libcurl is receiving the response from the HTTP server, it will call the C routine specified via the option CURLOPT_WRITEFUNCTION each time it has a full buffer of bytes. While it is possible for us to be able to specify a C routine from R (using getNativeSymbolInfo), we currently don't support this. Instead, it is more natural to specify an R function which is to be called when appropriate. And this is indeed how we do things in RCurl. One can specify a function for the writefunctionwriteheader and debugfunction options. (We can add support for the others such as readfunction.) To use these is quite simple. We expect an R function that takes a single argument which is the character of bytes to process. The function can do what it wants with this argument. Typically, it will accumulate it in a persistent variable (e.g. using closures) or process it on-the-fly such as adding to a plot, passing it to an HTML parser, .... The function basicTextGatherer is an example of the idea and this mechanism is used in getURL. Suppose, for some reason, we wanted to read the header information that was returned by HTTP server in the response to our request. (This has interesting things like cookies, content type, etc. that libcurl uses internally, but we may also want to process.) Then we would firstly use the header option to turn on the libcurl facility to report the response header information. If we just do this, the header information will be included in the text that getURL returns. This is fine, but we will have to separate it out by finding the first line, etc. Instead, it is easier to ask libcurl to hand the header information to use separate from the text/body of the response. We can do this by creating a callback function via the basicTextGatherer function.
h = basicTextGatherer() txt = getURL("http://www.omegahat.org/RCurl", header = TRUE, headerfunction = h$update)All we have done is create a collection of functions (stored in h) and passed the update callback to libcurl. Each time libcurl receives more of the headers, it calls this function with the header text. It may call this just once or several times. This depends on how large the header information is, how libcurl buffers the information, etc. Having called getURL, we have the text from the URI. The header information is available from h, specifically its value function element.
h$value()The debugGatherer is another example of a callback that can be used with libcurl. If we set the "verbose" option to
TRUE
, libcurl will provide a lot of information about
its actions. By default, these will be written on the console
(e.g. stderr). In some cases, we would not want these to be on the
screen but instead, for example, displayed in a GUI or stored in a
variable for closer examination. We can do this by providing a
callback function for the debugging output via the debugfunction
option for libcurl.
The debugGatherer is a simple
one that merely cumulates its inputs in different
categories and makes them available via
the value function.
The setup is easy:
d = debugGatherer() x = getURL("http://www.omegahat.org/RCurl", debugfunction=d$update, verbose = TRUE)At the end of the request, again we have the text from the URI in x, but we also have the debugging information. libcurl has called our update function each time it has some information (either from the HTTP server or from its own internal dialog).
(R) names(d$value()) [1] "text" "headerIn" "headerOut" "dataIn" "dataOut"The headerIn and headerOut fields report the text of the header for the response from the Web server and for our request respectively. Similarly, the dataIn and dataOut fields give the body of the response and request. And the text is just messages from libcurl. We should note that not all options are (currently)) meaningful in R. For example, it is not currently possible to redirect standard error for libcurl to a different FILE* via the "stderr" option. (In the future, we may be able to specify an R function for writing errors from libcurl, but we have not put that in yet.)
http://www.omegahat.org/cgi-bin/form.pl?a=1&b=2
getForm("http://www.google.com/search", hl="en", lr="", ie="ISO-8859-1", q="RCurl", btnG="Search")The result is the HTML you would ordinarily see in your browser. You might use htmlTreeParse to parse it. What is important in the example is that we are specifying the required fields in the query as named arguments to R. getForm takes care of bringing them together and constructing the full URI name. Note that libcurl also handles escaping the special characters, e.g. converting a space to %20. Note that if you wanted to explicitly do this escaping on a string rather than having libcurl implicitly do it, you can use curlEscape. Similarly, there is a function curlUnescape to reverse the escaping and make a string "human-readable". postForm is almost identical. Let's submit a POST form to http://www.speakeasy.org/~cgires/perl_form.cgi
postForm("http://www.speakeasy.org/~cgires/perl_form.cgi", "some_text" = "Duncan", "choice" = "Ho", "radbut" = "eep", "box" = "box1, box2" )Here, the form elements are named some_text, choice, radbut, box. We have simply provided values for them. Again, the result is the regular response from the HTTP server. Sometimes we already have the arguments in a list. It is slightly more complex then to pass them to the function via the ... argument. The two form submission functions in RCurl (getForm and postForm) also accept the name-value arguments via the ... parameter. This arises in programmatic access to the functions rather than interactive use. Since we use ... for the name-value pairs of the form, we cannot specify the libcurl options (unambiguously) in this way and we require than any such options to control the HTTP request at the libcurl-level be passed via the .opts parameter. RCurl and libcurl construct the HTTP request and after that, the request is just like a regular URI download. All of the usual techniques for reading the response, its header, etc. work.
handle = getCurlHandle() a = getURL("http://www.omegahat.org/RCurl", curl = handle) b = getURL("http://www.omegahat.org/", curl = handle)It is important to remember that if we set any options in any of the calls, these will be set in the libcurl handle and these will persist across requests unless they are reset. For example, if we had set the
header=TRUE
option in the first call
above, it would remain set for the second call. This can be sometimes
inconvenient. In such cases, either use separate libcurl handles, or
reset the options.
The function
dupCurlHandle allows us to
create a new libcurl handle that is an exact copy of the
existing one. This allows us to quickly reuse
existing settings without having them affect
other requests.
(The data in the option values are not copied).
See curl_easy_duphandle.
By reusing libcurl handles, we avoid reallocating
a new one and potentially benefit from improved connectivity.
One downside, however, when reusing handles is that the options we set in
R need to be copied as C data since they will persist across
R function calls in the libcurl handle itself.
As a result, there are additional computations needed.
Again, this is negligible in almost all cases
and will be dominated by the network speed.
libcurl doesn't have any explicit function for fetching a
URL. Instead, it uses a powerful but simple interface which involves
merely setting the options in the libcurl handle as desired and then
invoking the request. So one just prepares the request and forces it
to be sent. This is done via the curlPerform
function in R. This is how getURL is actually
implemented.
h = getCurlHandle() getURL("http://www.omegahat.org", curl = h) names(getCurlInfo(h))The names of the resulting elements are
[1] "effective.url" "response.code" [3] "total.time" "namelookup.time" [5] "connect.time" "pretransfer.time" [7] "size.upload" "size.download" [9] "speed.download" "speed.upload" [11] "header.size" "request.size" [13] "ssl.verifyresult" "filetime" [15] "content.length.download" "content.length.upload" [17] "starttransfer.time" "content.type" [19] "redirect.time" "redirect.count" [21] "private" "http.connectcode" [23] "httpauth.avail" "proxyauth.avail"These provide us the actual name of the URI downloaded after redirections, etc.; information about the transfer speed, etc.; etc. See the man page for curl_easy_getinfo.
$age [1] 2 $version [1] "7.12.0" $vesion_num [1] 461824 $host [1] "powerpc-apple-darwin7.4.0" $features ipv6 ssl libz ntlm largefile 1 4 8 16 512 $ssl_version [1] " OpenSSL/0.9.7b" $ssl_version_num [1] 9465903 $libz_version [1] "1.2.1" $protocols [1] "ftp" "gopher" "telnet" "dict" "ldap" "http" "file" "https" [9] "ftps" $ares [1] "" $ares_num [1] 0 $libidn [1] ""The help page for the R function explains the fields which are hopefully clear from the names. The only ones that might be obscure are ares and libidn. ares refers to asynchronous domain name server (DNS) lookup for resolving the IP address (e.g. 128.41.12.2) corresponding to a machine name (e.g. www.omegahat.org). "GNU Libidn is an implementation of the Stringprep, Punycode and IDNA specifications defined by the IETF Internationalized Domain Names (IDN)" (taken from http://www.gnu.org/software/libidn/).
none ssl win32 all 0 1 2 3 attr(,"class") [1] "CurlGlobalBits" "BitIndicator"We would call curlGlobalInit as
curlGlobalInit(c("ssl", "win32"))or
curlGlobalInit(c("ssl"))to activate both SSL and Win32 sockets, or just SSL respectively. One can specify integer values directly, but this is less readable to others (or yourself in a few weeks!). The names are converted and combined to a flag using setBitIndicators.
opts = curlOptions(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE) getURL("http://www.omegahat.org/RCurl/testPassword/index.html", verbose = TRUE, .opts = opts)Here we create the options ahead of time and use them in a call while specifying additional options (i.e. "verbose"). Some readers will have noticed that we could achieve the same effect of having a set of fixed options that are used in a collection of calls by reusing a libcurl handle. We could create the handle, set the common options, and then use that handle in the set of calls. This is indeed a natural and often good way to do things. The following code does what we want.
h = getCurlHandle(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE) getURL("http://www.omegahat.org/RCurl/testPassword/index.html", verbose = TRUE, curl = h)The first line creates a new handle and fills in the three "persistent" options. These are in the handle itself, not in R at this stage. Now, when we perform the request via getURL, we specify this libcurl handle and provide the "verbose" option. The function curlSetOpt is used implicitly in the code above and this actually sets the option-values in a libcurl handle. It can also be used to simply resolve them.
POST /hibye.cgi HTTP/1.1
Connection: close
Accept: text/xml
Accept: multipart/*
Host: services.soaplite.com
User-Agent: SOAP::Lite/Perl/0.55
Content-Length: 450
Content-Type: text/xml; charset=utf-8
SOAPAction: "http://www.soaplite.com/Demo#hi"
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/1999/XMLSchema"
xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance">
<SOAP-ENV:Body>
<namesp1:hi xmlns:namesp1="http://www.soaplite.com/Demo"/>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
Accept: text/xml
Accept: multipart/*
SOAPAction: "http://www.soaplite.com/Demo#hi"
Content-Type: text/xml; charset=utf-8
body = '<?xml version="1.0" encoding="UTF-8"?>\ <SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" \ xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" \ xmlns:xsd="http://www.w3.org/1999/XMLSchema" \ xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" \ xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance">\ <SOAP-ENV:Body>\ <namesp1:hi xmlns:namesp1="http://www.soaplite.com/Demo"/>\ </SOAP-ENV:Body>\ </SOAP-ENV:Envelope>\n' curlPerform(url="http://services.soaplite.com/hibye.cgi", httpheader=c(Accept="text/xml", Accept="multipart/*", SOAPAction='"http://www.soaplite.com/Demo#hi"', 'Content-Type' = "text/xml; charset=utf-8"), postfields=body, verbose = TRUE )Note that this similar to calling getURL and we have used it to illustrate how we can use curlPerform directly. The only difference is that the result is printed to the console, not returned to us as a character vector. This is a problem when we really want to process the response. So for that, we would simply replace the call to curlPerform with getURL:
curlPerform(url="http://services.soaplite.com/hibye.cgi", httpheader=c(Accept="text/xml", Accept="multipart/*", SOAPAction='"http://www.soaplite.com/Demo#hi"', 'Content-Type' = "text/xml; charset=utf-8"), postfields=body, verbose = TRUE )
[odbAccess] The odbAccess package: creating S functions from HTML forms.. odbAccess (coming soon)
[SOAP] Programming Web Services with SOAP. O'Reilly
[1] I have a local version but they are not connections since the connection data structure is not exposed in the R API, yet!