To work on this assignment, you will need to to find and do the following:
<sch:schema>
down. To write Schematron rules for a document in the TEI namespace, you will then replace this with:
<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2" xmlns:sqf="http://www.schematron-quickfix.com/validator/process" xmlns="http://purl.oclc.org/dsdl/schematron"> <ns uri="http://www.tei-c.org/ns/1.0" prefix="tei"/> </schema>
</schema>
root element.tei:
prefix before each of your elements since we are working with a document in the TEI namespace. Remember that we do not use that prefix before attributes because attributes are in no namespace.The Dickinson project team is using TEI <app>
elements inside the lines of Dickinson’s poems when they need to encode a set of variant words or phrases that appear in different publications, labeled in the <rdg>
elements with their @wit
attributes. You are working with a single file representing a set of poems from a collection of manuscripts or fascicle that Emily Dickinson bundled and bound together herself. For this assignment, you will write Schematron to function on top of the established TEI Relax NG Schema to help ensure that the <app>
and <rdg>
elements are written properly according to the rules of the team. You will need to write a few rules to make sure that particular elements and attributes are appearing where we need them to, to make sure the poems are appearing in the proper order in this document (Poem 1 through Poem 11), and to control for missing or additional white spaces around our tags that might be distorting our representation of the poems.
<rdg>
element has no other attribute but @wit
. We want to make sure that the <rdg>
element has nothing but an @wit
attribute, but it must have this attribute. (The TEI schema by itself will allow other attributes or no attributes at all on this element, but we want to make sure that our project team only uses just this @wit
attribute and not others.) Consider that we want the Schematron to tell us when @wit
is missing and decide whether you need to write an <assert>
or a <report>
rule for this. Our solution uses the not()
function in @test
to fire if the <rdg>
does not have @wit
(including if it has any attribute other than @wit
).
Remember to use the tei:
prefix before your element names! (Examples: tei:app
and tei:rdg
)
<rdg>
elements inside an <app>
element. For this rule we want to check that an <app>
element has a count of at least one or more than one <rdg>
element. Note: For every new rule that matches in some way on the <rdg>
context, you need to position it inside a new Schematron <pattern>
element because otherwise only the first rule at a given context will fire and the others will remain passive.First we need to look over our XML document that is holding all of the poems and find out where all of the poem titles are located in the hierarchy. Those poem titles each hold a number (1 to 11) that indicates where they properly sit in the sequence of the collection. We know that these <title>
elements are positioned inside the div/head
of each poem. Notice that each begins with the same pattern of text: the word "Poem" followed by a white space and a one or two-digit number.
<rule>
element and set its @context
accordingly. (Our solution sets the @context
at the title
position. Think about whether you want to use the following-sibling::
or the following::
axis. Either way, you will need to compare a number in the <title>
element of a current poem to that of the first, immediately-following poem.)<assert>
or <report>
test that isolates the number of the poem inside the title element at your context. Our solution uses the number()
function to convert the numeral(s) into a literal number, and then adds + 1
to test if that value equals the number of the poem given in the next following poem div
(stepping down into its title
element to isolate and convert and read its number). (Think about why we need to add + 1 here, or perhaps alternative ways you could write this test.)
substring()
function (which you may wish to look up in Michael Kay or w3schools to see how this is formatted). The substring()
function takes three arguments. The first argument indicates the XPath node (so if you set your rule context at the title, you would just invoke the self::*
or dot(.
) as the first argument). The second and third arguments are numbers: The second argument gives the numerical position of the character in the whole string of text that indicates the point where you want to start extracting your substring (so for this, count over from the start of the title to the first digit you want). The third argument indicates how many characters you want to extract into your substring. So the function is set up like this:
substring(XPath, character-position-number-to-start, number-of-characters-to-extract)
Note: since we have 11 poems, we are going to need to extract two characters to deal with Poem 10 and Poem 11.substring()
in a number()
function to convert it, and now work with it as a number. Add + 1
to it, and see if that value, (substring() + 1)
equals the substring()
in the title of just the very next poem in the sequence.@wit
attributes sitting on the rdg
elements to be sure they are not mistyped. This is something you are likely to need in your projects, so we direct you to our special Schematron tutorial on testing unique identifiers, which shows you how to work with @xml:id
s (unique identifiers) and their corresponding referencing attributes.
Can you adapt the code in our tutorial to work with this file and its positioning of the list of witnesses in this document?
<app>
and <rdg>
elements in a line of poetry. As the team works on coding these poems, it is very easy for them to accidentally remove or add white space in applying <app>
and <rdg>
elements. It is very easy to make two words run together by accident, for example, by coding like this:
<l n="1">When we stand on the tops of<app> <rdg wit="#df16">Things—</rdg> <rdg wit="#bm">things</rdg> </app> </l>Notice that there is no space before the opening
<app>
tag and no space inside the opening <rdg>
tag, so when the team transformed this to view the first witness in HTML, we saw something like this:<app>
element, and sometimes we do not. <app>
ends with white space already, when it has a special punctuation mark, a dash (—
) or a quotation mark ("
) designed to connect with the text in the <rdg>
element(s).<app>
ends with something other than the three characters we described above and the rdg
element inside begins with a letter (another alphabet character).<app>
ends with any non-space character followed by a space, and the <rdg>
element opens with a white space.parent::app
is followed by a string of text.@context
of the tei:rdg
element, because we need to look at each <rdg>
element in turn to see if we have a white space problem, and there are often multiple <rdg>
elements inside each line. When we set the context to the whole line of poetry, it might have multiple sets of <app>
elements inside, and we cannot write a precise enough rule to address the spans of text we need. To proceed, we need to understand something about mixed content: When an element like the TEI l
(for a line of poetry) contains a mixture of text()
and other elements, the text()
node is sitting in a sibling relationship to the nested elements, so that a span of text in a line of a Dickinson poem is sitting on the preceding-sibling::
axis in relation to the <app>
element that follows it. If you write your rule as we did, from the context of the <rdg>
elements, you will need to write your test to reach up to to the parent <app>
elements and walk over to the preceding-sibling::text()
node. We use the matches()
function in our Schematron @test
because it works with regular expressions and helps us to identify the particular conditions we are looking for. (Look up this function in one of the sources we list in the Preliminaries section of this assignment to be sure you understand how to write it.) Specifically, we are going to need a two-part test, and we can use the matches()
function twice, joined by the word and
to see first if a) the line of text that is the first preceding-sibling of our parent <app>
ends with something in a regex character set, and b) the contents of our context <rdg>
element starts with something in a regex character set, like this:
test="matches(. . .) and
matches(. . .)"
or
test="not(matches(. . .) and
matches(. . .)"
or some combination of these.
Note that the matches()
function takes two arguments like this: matches(Xpath-location, 'regex-pattern')
. You might be wondering why we aren’t using the functions starts-with()
or ends-with()
. The answer is that these do not help us with finding regular expressions, but matches()
can look for a regex pattern wherever we need it. To designate the start of a line in regex (or the start of the text in a given XPath node, use the regex caret, ^
, at the start of the regex pattern you are hunting for, and to designate the end of the text, use the regex dollar sign, $
at the end of your regex pattern.
Bonus task: You will likely have difficulty with matching on a quotation mark, because if you try to include it literally in the character set (or even escape it), it will be interpreted as the end of the schematron attribute and will result in a formedness error, munging your Schematron code. Consider it a bonus task on this assignment to find a way to match on a straight quotaton mark. Hint: you will need to escape the literal quotation mark using "
, but you won’t be able to include it in a [ ]
character set.
See how far you can get with this Optional Challenge Task and if you get stuck, record what you tried and what didn’t work. Do your tests fire? You should see some white space errors in the file as we presented it, but you should also tinker with the white space just before an <app>
tag and at the start of an <rdg>
element.
Upload your completed Schematron schema AND the Dickinson poems XML with your Schematron associated to Courseweb, and follow our standard filenaming conventions for homework assignments uploaded to Courseweb.