Distributed Matter - Blog

To content | To menu | To search

Friday, December 8 2006

gsiftp URI madness

Updated 21/08/2007: Added workaround
Updated 02/08/2008: Moved workaround at the top

The workaround

One way to have consistent gsiftp URIs with both globus-url-copy and the CoG kit is to use // for absolute paths and /~/ for relative paths. They should work with both clients. What a URL with just one slash points to still depends on which client you use, so you should avoid them if you can.

The problem

Globus's GridFTP has become the GGF standard for transfering files in a Grid enviroment. It is mainly an extension of FTP that is able to use GSI (Grid Security Infrastructure) authentication.

When using protocols such as FTP or HTTP, it is quite natural to use the URI (URL) to refer to a file. Even when FTP is considered separately from the Web (i.e. even if clicking on an FTP URL in a web browser didn't work), the concept of a URI helps a lot to address files. Similarly, I'd like my applications to be able to keep track of the files stored on GridFTP servers using URIs. There is some Globus tool support for using GridFTP URIs (prefixed with gsiftp://), in particular in globus-url-copy (which is a generic tool to copy a file from one URL to another URL) and in the Java CoG kit (which provides a Java implementation of much of the Globus Toolkit, and even more).
Sadly, using gsiftp URIs is just not possible.

Not only the gsiftp:// URIs are not formally defined in the GGF standard[1] (and just barely in the globus-url-copy documentation), but there a no fewer than 3 ways of interpreting the same URI!

The default globus-url-copy format (gsiftp://host/absolute-path/file).

In this case, the path refers to the absolute path on the server. A URI to a file in the home directory ($HOME/testfile) can be written like this:

  • gsiftp://host/home/username/testfile

(provided that whoever uses it knows that $HOME is /home/username), or

  • gsiftp://host/~/testfile

The main problem is that it is counter-intuitive when one has the FTP URI format in mind.

The RFC 1738 way (similar to FTP URIs), using globus-url-copy -rp .

The standard for FTP URIs says that the path should be relative to the initial path where FTP server logs in the client. For example, ftp://host/path1/path2 should perform the equivalent of cd path1 and cd path2. The vast majority of FTP servers set the default location to the home directory. The same RFC says that if you want an absolute path from that, the first / (root) should be encoded as %2f. Therefore, using the -rp option of globus-url-copy, the following two URIs should refer to the same file:

  • gsiftp://host/testfile, and
  • gsiftp://host/%2fhome/username/testfile

The Java CoG format.

This one is also relative, like the RFC 1738 format, but uses // (two slashes) instead of /%2f to designate the root directory. For example: gsiftp://host//home/username/testfile

Conclusions

The problems really start if use both URIs that have an absolute path and others that have a relative path. For absolute paths, formats 1 and 3 behave more or less in the same manner (at least, URIs written using // would work with 1 and 3); for relative paths, formats 2 and 3 behave in the same manner, but differently to format 1. Since some of our files are at absolute path locations and others are at relative path locations, and since we'd like our application to be partly using globus-url-copy and partly using the Java API (of the Java CoG), using gsiftp:// URIs becomes a bit tricky...

Giving 3 possible interpretations for a given URI spoils a bit the point of the identifier. This is just unusable.
What I find shocking is that these three different interpretations have actually been produced within a single project: Globus. If Grid interoperability is not achieved within a single project, how can it ever work across several of them?

Note

[1] The GGF standard mentions URLs that could be presented to a server, but the context of use is not quite clear.

Thursday, December 7 2006

AstroGrid workshop

Earlier this week, I went to the AstroGrid workshop, in Oxford. AstroGrid aims to provide a Virtual Observatory (VO). In particular, it makes it possible to put together catalogues, images and various data obtained from surveys, enabling astrophysicists to do e-astronomy (not sure if that e-word actually exists). Some of the astronomy bits were beyond my understanding, but the workbench happens to work and to be clear enough for someone who's not an expert in both computers and astronomy. Matching radio sources (obtained from radio-telescopes) with pictures (obtained from optical telescopes) appears to be easy. In fact, it's quite fun. It's also possible to incorporate the results in a variety of other astronomy tools in a few clicks. From the technical point of view, this Grid tool relies on Java WebStart for launching the application. This eases the deployment and requires little installation from the client point of view (apart from the Java Runtime Environment, obviously).

As most pieces of software, it's not perfect, but at least this one works; it even works well. The development team is also very keen to improve it and we had good discussions.

I also managed to raise some concern about the ivo:// URI schema, and I'm glad my comments were welcome. I think we'll have constructive discussions again around that type of issues. It seems ivo:// is not really used for protocol purposes, but mostly a means to identify resources in the IVO MySpace. This sounds very much like the LSID-related discussions about URNs, Namespaces and Registries. This document is definitely worth reading (although, to be honest, it took me a few discussions with Mark to understand it -- I still reckon new schemas are justified sometimes, when the protocol used is completely different). Anyway, the workshop was good, we had interesting discussions, and we'll continue these discussions for sure.

Wednesday, November 22 2006

Experiences with WSRF

About a year ago, I was new to the world of Web Services (WS-*). I wasn't new to the world of the Web, although I can admit I hadn't read much literature about it (such as Architecture of the World Wide Web, Volume One or Roy Fielding's PhD dissertation). Being involved in Grid Computing, it seemed natural that I spent some time learning what WS-*, SOAP and the Web Service Resource Framework (WSRF) were.

I wanted to develop a system where I could represent tasks and queues of tasks independently of the actual location on which these tasks would be allocated and executed. Had I not been in a WS-* world, I would have used what I already knew: I would have served some XML (or perhaps even some plain text) dynamically over an HTTP server, probably using a combination of Apache+PHP+MySQL. But then I thought: "That's what I would have done at the time I used to program at home, based on various tutorials found in linux magazines. I'm a grown-up now; I should investigate these fancy WS-* stuff that promise wonders".

If one reads articles and books about Web Services, they sound really good. Having mechanisms that are both flexible and interoperable (including language independent, which I've always liked) sounds great. I bought "Web Services Platform Architectures" by Sanjiva Weerawarana et al. It's a good book and it gives a good overview of the whole thing. Then, I started to play with WSRF (using WSRF::Lite, but also experimenting with Apache WSRF).

WSRF::Lite was really good to develop a quick prototype of my application. The mapping of HTTP GET to GetResourcePropertyDocument (possible since WSRF::Lite uses URI-only EPRs) was great for debugging as well, since it was easy to see what the properties were using just a web browser. Since interoperability was what I was really after, I started developing clients in C (using gSOAP) and Java (using Axis' wsdl2java). Writing the right WSDL that would work with both turned out to be a bit of a pain, but it worked (Wireshark was a very useful tool for this). Most of this fiddling was about namespace issues and rpc vs. document style in WSDL/SOAP.

Benefits of WS tooling support

The pro-WS talks I was reading and hearing about were praising the tool support that Web Services had, but what were the benefits, really? As a developer, the main problem I face when designing an application is to find a programming model. On the one hand. what I was saying earlier about using simply Apache+PHP+MySQL+XML would have left me with the task writing some clients manually. On the other hand, tools like gSOAP and wsdl2java can provide you with stubs quite easily.

However, the resulting programming model is very much that of distributed objects (there's not that much difference between the way you program as stub-based WSDL client and the way you do it using RMI or CORBA), except that WS-Resources are distributed objects that have the magical property of going through firewalls. WSRF operations such as GetResourceProperty and their stub mappings return some XML infoset. The typing system of C and Java make it useful (and often necessary) to map such XML infosets to types known by the client applications. Both gSOAP and wsdl2java generate type structures from the XML schemas defined in a WSDL document, but it is still up to the programmer to handle more or less manually the mapping of those XML infosets. Fair enough, I don't argue with that. However, this leads me to wonder what the point of these tools is. Thinking back to what I knew before entering the world of WS-*, it doesn't really bring anything compared with what I would have done using something like XmlBeans on the client side.

Benefits of SOAP over HTTP+XML

''With SOAP, you can access Web services through loosely coupled infrastructure that provides significant resilience, scalability, and flexibility in deployment using different implementation technologies and network transports" (quote from S. Weerawarana's book). That sounds cool! I could use SOAP for having the same message abstraction over HTTP and e-mail for example?

In practice, the vast majority of the example I've seen were using SOAP over HTTP. I can understand the motivation for leaving the door open to protocols other than HTTP. It might indeed not be the best for all applications. But would SOAP really help for what HTTP cannot handle well? I doubt so. SOAP over HTTP uses exclusively HTTP POST, thus preventing the use of many cool features of HTTP, in particular caching. SOAP in an e-mail? Why not, but that's going to be a big e-mail. I was able to achieve the same thing about 10 years ago via HTTP to e-mail gateways. Same principle, except the the overhead of putting an HTTP header in an e-mail is probably less than that of putting the whole SOAP packaging.

Benefits of interoperability and standards

Interop. Really? Later on in the project, perl's total disrespect for the typing information related to remote entities and, more importantly, SOAP::Lite's complicated and often inconsistent handling of complex XML types made me want to trade my WSRF implementation for another. WSRF had become a 'standard' since then, so it shouldn't have been a problem. Ideally, I could have reused the WSDL for my WS-Resources in Apache WSRF (or any other for that matter). Sadly, it turned out that WSRF::Lite was using an rpc/literal style whereas Apache WSRF was using a document/literal style. This simply meant that I couldn't just rewrite the server, but that I had to rewrite the clients as well, since the stubs were now different too. All the time I spent fixing my WSDL: wasted!

Conclusions

I think the final drop was the move to WS-RT. I spent all that time reading the WSRF specs, experimenting with various WSRF implementations that were not really interoperable (I mean by this that I wanted to swap one for another). The thing becomes a standard, and it's made obsolete just after a few months.

Fed up with all this I decided to drop WSRF and to use an solution based on serving XML over HTTP. I took the time to read some of the REST literature as well. I developed a similar system using Servlets and XmlBeans in about 1/10th of the time spent on WSRF. The only features of WSRF that have been lost are the resource lifetime and the fault handling. Not a great loss as they were not that well defined or useful anyway. What I've learnt is that using technologies that were available 10 years ago (plus some more recent tools like XmlBeans), I was able to have a stable system, better than what I had using WSRF. This HTTP+XML based system is even more robust than the WSRF system I had; it makes use of idempotent properties of HTTP PUT for example, thus making it possible to recover from a message loss.

Before learning about these WS-* technologies, I was sceptical about their added value. Now I know. WSRF is just not worth it. I'd like to see a good use-case for using SOAP at all one day.