Computing as a theoretical discipline

Note: my three contributions are put together here in one, separated by a bar.

As it turned out, my treatment of "images" follows quite logically from my treatment of "information processing". So I did put these two together.

==================================================================

Defining Humanities Computing methodology

During the development of computing fads have been coming and going; the way in which the Humanities reacted to "computing", "computer science", "information technology" or whatsoever else the most popular term has been at a specific time, has changed along with them. This dependency on trends, as reflected by popular media, has not always been healthy. The discussions about artificial intelligence in the later eighties are a good example. Than for a short time not at least to speak about the necessity of including an expert system in a project was almost impossible. Unhealthy this has not been so much, because very little ever came of that fad, but because it has rather fundamentally discredited the notion of expert systems being relevant for the Humanities at all. Therefore, even considering the notion seems currently to be politically almost as unwise, as it seemed unavoidable than.

The following section tries to clarify, what we can actually say about the relationship between computer science and the Humanities, that remains valid, while fads change. We add one more restriction: Traditionally discussions of that type become easily unfocused, because there are three types of relationships between a Humanities' scholar and computing technology, which, to the detriment of them all, are frequently intermingled. Computers can be used to gain scientific knowledge, to teach that knowledge and to publish it. These three sets of activities are of course related; but the challenges the pose and the problems they have to solve, are quite fundamentally different.

In the following, we intentionally restrict ourselves to the first of the three: We are dealing with methods, that is with the canon needed to increase the agreed upon knowledge within an academic field. And we restrict ourselves with those, which can profit from the use of computational equipment or concepts. As this invariably requires the possibility to pose a question in such a way, that a formalism exists, we speak about "formal methods". This is an intentional restriction of the field of discussion. We do not e.g., discuss how to use a computer to teach a traditional subject, nor how to produce books more cheaply. (Though at some stage the reader will find a discussion on how far the new media make information available at such a scale, that the methods to cope with it have to change).

A final restriction: Information Technology as such is changing the world in which we are living in many ways. The arts on the one hand and the social sciences on the other, are very much geared to the reflection and interpretation of the world in which we are living. They, therefore, have to tackle IT as other changes of the society in which we live. Humanities, in our understanding, are different, however, from the production of art and from the interpretation of societal changes. So neither the artistic, nor the sociological implications of a new generation of media are our topic.

While we restrict our topic in this way, in another we would like to see it as broadly as possible. The formalisms needed to apply some of the tools to handle language, which have been developed in computer linguistics, are part of our agenda; as are the pre-requisites to apply the canon of quantitative methods; as are the considerations which have to go into the transfer of knowledge from a document into a data base; as are the assumptions that go into the application of a GIS to a Humanities' topic. These are formalisms, which from our point of view, however, exist independently from the Humanities. GIS are not our topic, nor databases, nor quantitative methods, nor, indeed, computational linguistics. Our subject is their application to the knowledge domain of the Humanities, to improve the possibilities for research in the later.

Talking to Ralph Griswold, the developer of SNOBOL, one of the early programming languages dedicated to the processing of strings and, therefore, textual objects, one of the authors of this chapter once listened to the following story: "You know, computer science is all but a homogenous field. A short time ago I had a European visitor. Talking about various matters he said at one stage: 'Being a professor of computer science, I sincerely hope, that nobody will ask me to get close to a keyboard again.' Having done programming for most of my life,", Ralph Griswold continued, "I did feel offended."

"Computer science" is a very wide ranging field, going from one extreme, where it becomes almost indistinguishable from mathematics, to another one, where it is equally hard to tell the differences between it and electrical engineering. This, of course describes the genealogy of the field: the Turing machine would be an interesting mathematical construct, even if it would have no relationship to anything ever built out of a material more solid than ideas. Transistors did change many aspects of our daily life, long before computers entered it in any but exotic positions.

Having widely different ancestors in itself, computer science in turn became parent to a very mixed crowd of offspring: Disciplines like Medical computer science, Juridical computer science have sprung up in recent years abundantly. Some of them, like the "forestry research computer science" (Forstliche Biometrie und Informatik) for which a German university recently accepted a Habilitation, will probably continue to raise eyebrows for some time to come. Others, notably computational linguistics, have established themselves as independent areas of research and self contained academic disciplines quite beyond dispute.

The existence of this wide variety of disciplines, related to or spun off from computer science in general, implies two things. (a) In computer science itself, hybrid as it is, there must be a core of methods, which are independent from their origins (therefore we do not speak about medical mathematics). (b) For the application of this methodological core a thorough understanding of the knowledge domain to which it is applied is necessary (otherwise the concept of a medical computer science would not make sense).

As in many other cases, what does not constitute this "self contained, but application related" core is more easily specified, than what does. Pure and clean engineering topics are not part of it - though, of course, the construction of sensors in the bio-sciences may require knowledge, which the construction of sensors in thermal physics does not. The mathematical hard core should also be independent from the disciplines to which it is applied - though, of course, there are fields where fuzzy systems and their backing theory are more central than within others.

Leaving aside these subtle shades, for the purpose of a short introduction, we define: The core of computer science, which is more than the sum of its intellectual ancestors, which still requires an intimate knowledge of the knowledge domain to which it is applied, however, is the following.

The methods needed to represent the information within a specific domain of knowledge in such a way, that this information can be processed by computational equipment. (The study of the data structures required by a specific discipline.)
The methods needed to formulate the research questions and specific procedures of a given domain of knowledge in such a way, that the attempt to answer them gains from the application of computational techniques. (The study of the algorithms applicable to a given discipline.)

This may, at first look, seem to be a highly abstract definition, which has few practical consequences, particularly if compared to what is actually going on in Humanities Computing.

As to practical consequences: Surprisingly the preceding paragraphs lead to a few conclusions, which may explain, why a very large number of attempts at introducing university courses in some branch of Humanities Computing have failed, over the years.

If we accept the assumption, that the way in which the general core of computational methods, in the sense above, is used, depends on the domain of knowledge to which it is applied, we also have to accept, that applying computational methods without an understanding of the domain to which they are applied, leads to disaster. In more practical terms: A German university in the early eighties introduced a study programme called Informatik für die Geisteswissenschaften, which required more course credits for numerical analysis than a computer science master at many other universities. The same course did not require of the students to work in a single project, which asked them to apply their knowledge to a topic of the Humanities. After a spectacular student interest in the first year, the course had to be stopped in the second, as no students were willing to take it anymore.

It is pointless to teach computer science to Humanities scholars or students, when it is not directly related to their domain of expertise.

On the other hand, time and again, skills in computing are mistaken by Humanities scholars for a qualification in computer science. A good point in case is the plethora of word processing courses, which rose at American universities in the early days of the PC introduction, again, in the eighties. Few of these did not collapse within a few years, as the students discovered, that it was ultimately more convenient to learn the content of such courses at their own pace, based on general manuals and introductions.

Humanities computing, which is not based on an understanding, what computer science is all about, is a transient phenomenon, fluctuating wildly with the fads of fashion.

If these seem for the reader not sufficiently practical conclusions drawn from the initial statements, we ask her/him to remain patient for one more consideration, before we turn the observations into recommendations. How are the definitions above to what is actually going on at European universities?

We propose, to group the teaching and research, that can be observed at the various Humanities related institutes and faculties into three groups.

A very large number of courses at Europe's universities are dedicated to the provision of basic computational skills for Humanities students. These will usually be geared towards specific disciplinary needs: A student of Russian needs to know how to write, display and print Cyrillic. As long as they are related to skills only, they do not influence the way in which scientific results are gained. At this level we are simply talking about the application of tools.
A much smaller number of courses - and a substantial number of research projects - use computationally based methods (like data base technology) or computationally dependent ones (like statistics) to gain scientific results, which could not be gained without the tools employed. At this level, therefore, we talk about the application of methods.

A small number of courses and projects, finally, deal with the study of computational methods themselves, aiming at their improved understanding, without claiming directly, to gain a new insight in the discipline. They are involved with the development of methods.

For readability's sake, we will refer to these levels in the following paragraphs as the Humanities Computer Literacy, the Humanities Computing and the Humanities Computer Science levels respectively.

For all practical purposes, most public discussions have been focusing on the Humanities Computer Literacy level. This is most unfortunate, as it is exactly here, where the changes of requirements are most frequent. And its is the low mean life expectancy of such courses, which create the feeling, that no progress is being made. On the semi-joking level: The decision of a German university to accept a course "Computer Science for German Studies: WordStar 2000" in the eighties, did not only damage the credibility of the Humanities in the computer science department at that university, the simple fact of the short half life of such application packages implies a very short usefulness of such courses. Less humorous: There has been a department founded for Computing for the Humanities, which was created in the eighties to provide computer literacy for each student of the arts faculty. Not to far into the nineties, at least one of the departments of that faculty put a threat to them, that they would train their students by independent courses, if they would not revise their curriculum to the recent needs. And recently this department has been closed down, as the arts faculty considered it without value for their students.

Considering elitist positions, one might wonder, whether it is the task of a university to teach basic computer literacy at all. Students never got academic credit for typewriting skills before the invention of word processing; why should they get such for word processing skills now? Before being accused of being overly elitist, however, we would like to point to two important differences.

The more visible one: Typewriting has been a skill that remained stable between finishing secondary school and gaining a doctorate. The modern information technologies have a habit of changing sufficiently rapidly so that what was almost arcane knowledge at the start of a freshman's (or woman's) first term now, can easily have turned into basic computer literacy at the time of her or his graduation as master, leave alone PhD. If we are taking the notion of lifelong learning serious, we might, therefore, claim, that computer literacy should indeed be something, the arts faculties should be concerned about: Not for its own sake, but to train students in updating their own knowledge - and impressing the constant necessity of it upon them.

Less visible: While new techniques like the usage of word processing, spread sheets, simple data bases and most recently web-authoring have rapidly turned from advanced knowledge to survival skills, one can master them completely and thoroughly - and still be helpless, when applying them to a Humanities discipline. Even today many people who use word processors routinely will find it challenging to include Cyrillic characters into their texts. A person can routinely submit his tax returns with the help of a spreadsheet and still despair in doing meaningful computations with a medieval list of taxation. A student can have a brilliant homepage but still be unable to encode a literary text in such a way, that it remains useful beyond the lifetime of his current full text retrieval package. Even computer literacy, therefore, has to be taught in the Humanities by concentrating on the specific problems posed by the disciplines. Word processing for literary disciplines has to concentrate on peculiarities of the specific languages of editorial styles; quantitative packages have to be taught to historians in a way to prepare them for a world of non-decimal numbers; markup for text-based disciplines has to look to general principles, not the peculiarities of a specific generation of browsers.

To fulfil both requirements, Humanities Computer Literacy should be taught to Humanities students only, if two prerequisites can be taken for granted: (a) It is taught by teachers who themselves are fully trained in Humanities Computing. (b) There is no fixed canon of skills, but it is understood, that precisely the courses at the most introductory level have to be revised year by year to keep them at the shifting edge between what a student can be expected to learn by her or himself and what they can not.

In a nutshell: Nobody should teach computing skills to a Humanities student, who has no experience in computer supported Humanities research, preferably in a subject close to the one from which the student population of the course to be taught is being recruited. Exceptions always exist; but there are few of the (many) conferences on one angle of Humanities computing or the other taking place every year, where the great problems of communication between "pure" technicians and content-interested Humanities students are not being described as a severe problem.

Humanities Computing, the second of our three levels does than constitute the sum of all existing methods, which can enhance the scientific validity of results in research or enable the pursuit of research strategies which otherwise would not be possible. It starts with methods adapted from other fields of study - for example the canon of analytical statistics, which has been developed for various fields. To apply this canon to authorship studies, the traditional sampling techniques have to be augmented in specific ways. It continues with methods which originated in other fields, has developed in completely independent approaches in specific Humanities disciplines, however. In art history, e.g., thesaurus based systems were originally adapted from other disciplines, have taken on a life of their own and started a discussion on the proper way to describe the content of images, which has no clear equivalence in other fields. An finally, there are computational methods, which developed more or less within a field of the Humanities, independent of other disciplines. For example, the long and rich tradition of methods and techniques for the identification of individuals in historical documents, though their names may vary by orthography, variable subsetting of name sets, property based name shifts and other causes.

Humanities Computing is a field, which is most clearly in need of being stabilised institutionally. The tradition of the field is incredibly long. Many of the questions about the best way of entering Humanities information into a computer in a form it can handle, which are being discussed today, can already be found at the conference volume of the Wartenstein conference in 1962, which seems to have been the first attempt at surveying the field. One of Humanities Computing major problems is, that it has a tradition, of which few of its followers are aware. It is highly significant in that context, that today a fresh wave of discussions about whether such a field has been ignited by two widely popularised WWW papers of Willard McCarty, where the author simply assumes, that he can totally ignore a tradition of forty years and start from scratch.

This lack of perception is most unfortunate for the individual researcher, as it usually means, that newcomers to the field have to rediscover many solutions, which are well known since a long time already. It is even more unfortunate for the Humanities as a whole, as it means that the methodological advancement proceeds much slower, that it could. In most European countries, Humanities Computing describes a specific stage in the life of a scholar. The vast majority of practitioners are in the stage of their PhD thesis or in the years immediately after that. And, in the current system, in most European countries they face, after working actively in the field for ca. five years, a crucial decision. Either they become computer specialists, which means that the leave academia for the industry, or the fall back upon more traditional straits in their home disciplines, as permanent positions for Humanities Computing specialists rarely exist.

As long, as we stay with our original definition, that Humanities Computing is defined as the application of computational tools for the benefit of the various Humanities disciplines, there is nothing wrong with this situation. Still, it means, that many researchers all over Europe are constantly re-discovering some of the basics of Humanities Computing, while few, if any, possibilities exist to hand on their discoveries further. To solve that situation, we propose, that, as we asked Humanities Computer Literacy to be taught by people with a Humanities Computing background, Humanities Computing should in turn be taught by Humanities Computing Science specialists. Persons, that is, which make the study and development of the possibilities of computer applications in the Humanities their profession. With a solid background in one or more Humanities fields they understand the problems of these disciplines; with a strong background in computer science in general, they are able to contribute to the development of data structures and algorithms as defined initially.

This field of Humanities Computer Science has to be European from the very start. The field itself profits from the strongest possible emphasis on internationalisation: as any other new discipline, it is in the danger of being influenced overly much by the idiosyncrasies and preferences of a few individuals dominating a national academic system, otherwise.

Creating a European framework of reference has, however also an added European value. Very few institutions exist today, which offer training on a level, which could be clearly identified as Humanities Computer Science by the terms above. There are many attempts, however, to offer to Humanities students introductions into computational skills and appropriate background knowledge, bundled in a confusing plethora of degrees add-on diplomas and occupationally qualifying course. This has two massive drawbacks:

Within academia, it makes it almost impossible to implement fair competitions for evolving academic positions, if there are no terms of reference for the qualifications required. This is particularly serious for positions, which are offered for the emerging joint European academic job market.
Even more crucial, however: Virtually all of the courses just described are started with the promise to increase the employability of its students. This promise can only be kept in the future, if potential employers have a clear understanding, what skills the people have whom they are supposed to employ.

==================================================================

Information processing

During the roughly forty years for which Humanities Computing exists, computing equipment has progressed from punched cards to colourful images on the WWW. During that time, the products of Humanities computing have changed along with the technology: from the concordances and cross-tabulations, which had almost the character of archetypes in the first two decades to interactive multi-media presentations.

Still, much of that is superficial. Whether one uses a printed concordance or a full text data base with a mouse driven interface does not change the significance of word patterns found. We ask for the reader's patience, therefore, when we start by defining what information means in the Humanities, instead of describing the advances in the interfaces used to analyse it.

While significant variation exists between individual disciplines within the Humanities, there is, broadly speaking, one major difference between them as a whole and other fields of study, particularly the hard sciences. That is, that the Humanities in general, have very little influence on the creation of the information the process. The strength of a magnetic field is measured directly in units, which can be analysed by computational equipment. The style of a painting is a property, which can be ascribed by a trained observer with some degree of inter subjective consensus among similarly trained individuals. However, the assumptions going into the assignment of that description are infinitely further removed from any meaningful way to process a resulting keyword, than the concept of a continuous field strength from the way in which floating point numbers are handled.

Systematically, we can speak of three types of information, for which we will use the following terms in this section:

Raw information is derived from an original by a purely mechanical process. Typical examples are digitised sound and images.

Transcribed information is produced by a process which tries to differentiate between such properties of the original as are deemed significant and such which are not; the process of transcription does not intentionally change the content to be derived. A clear example is a transcription of a spoken interview or a hand written source, where the transcriber filters background noise or visual properties of writing. Let it be noted, to prepare a later argument, that the introduction of the concept of "significance" makes that kind of information much more specific for an environment, than the raw one. While a digitised interview will be meaningful for all sorts of language studies as well as for oral historians, the decision to remove from a transcription background noises like laughter, will significantly reduce its usefulness for some, but not all research paradigms.

Coded information, finally, is that, where the content of the original is transferred into another set of symbols: Descriptions of paintings by Iconclass codes come immediately to mind, or statistical data sets containing collapsible numeric codes.

Computing methods are used in the Humanities on all three of these levels. They are also used to transfer information between them. OCR turns printed text from raw information into transcribed one and computer supported content analysis (though not that popular today as it has been some years ago) is more or less a systematised attempt at converting transcribed text into coded information.

While we described these categories in increasing distance from the original material on which the analysis is bases, historically Humanities Computing has developed into exactly the opposite direction. I.e.: While in earlier years the emphasis has been on the usage of computing to analyse the relationships and dependencies between coded properties of objects of the analysis, we are now moving more and more towards attempts at analysing the raw information the Humanities have to deal with.

How we evaluate the significance of that development depends very much on the general methodological approach a researcher follows. One position is, that the methodological quality of a scientific argument is, among other factors, but very centrally, influenced by two factors: (a) The ability to explain the largest possible amount of evidence and (b) the intersubjectivity of the string of argumentation.

While not always explicit, these assumptions have been with us since the earliest days of Humanities Computing. In history, e.g., the major argumentation for the introduction of computer usage has been the ability to use "mass sources", where the information contained in huge numbers of by themselves meaningless individual events could be sensibly integrated into statistical arguments. And much of the opposition against it arose from a discussion, whether statistical argumentation actually increased intersubjectivity, as all the assumptions had to be made explicit, or whether on the contrary it damaged it, as statistical training was now needed to understand the argumentation.

These two methodological assumptions are always a useful starting point for a discussion of the significance of the processing of information within the Humanities. Even more so, as the trend to move more and more from an analysis of coded towards raw information, has taken major steps forward in recent years.

As case in point, the ability to handle images digitally has many important effects. In the line of the arguments given above, leading practitioners of the field in art history are currently moving towards formalisations of concepts like "style" or "colour usage" which are based on a direct analysis of the image material.

In a more general perspective, the arrival of image handling capabilities has changed very general assumptions about the usage of computers in the Humanities. A few years ago, it was obvious, that Computing in the Humanities meant first and foremost the application of computers within research. The explosion of visually attractive presentation tools has changed this quite fundamentally. In many cases, nowadays, the usage of computers in the Humanities seems to be focused more on the didactically well formed presentation of results, than on their generation.

This new emphasis on visuality may in the nearer future have some surprising effects: Notice, e.g., that much on the current discussion of markup schemes started from the fundamental assumption, that visual attributes of a text were just arbitrary indications of a conceptual dimension, while their visual representation was irrelevant. It will be interesting to see, whether this assumption survives the state of technological development, which originally favoured this notion, if it did not introduce it.

Also in other fields, the results of the introduction of image processing had unexpected results. Until very recently the use of digital resources, raw information in our terminology, in the world of images was centred on art history and art historical objects, while manuscripts were rather a side area. At the moment it almost looks as if that would be changing: One of the fastest growing sectors in digital resources for the Humanities are currently the digitised collections of books and manuscripts created by libraries and archives.

It is somewhat alarming, that these resources are mainly created outside of Humanities research and produced by institutions, which traditionally have been focusing on the accessibility of material, not on its production. The more so, because this may be the background of one of the more fundamental changes which the information technologies are currently creating, though the Humanities may not be so much aware of it, as they should.

One of the constants for all considerations of how to make sources available in all of the Humanities subjects has always been, that their visual reproduction has been very costly, specifically much more costly than the publication of their transcriptions or descriptions. All the Humanities disciplines have therefore focused on rules on how to select relatively small numbers of sources, which were sufficiently important or canonical, to merit their reproduction by transcriptions or descriptions. These, cheaper than photographic reproductions, were still very expensive: Many Humanities disciplines and sub-disciplines are based, therefore, on a very intensive and detailed discussion of relatively small numbers of canonical texts or corpora.

The tacit assumption behind that strategy does not exist any longer. It is clear already now, that the systematic re-production of huge amounts of source material in digital form is possible today with very small costs, making, in principle, accessible corpora of sources for discussion, which are several orders of magnitude larger than so far.