Consult the following resources as you work with Regular Expressions:
We downloaded some data about movies from the 1930s to 2018 in a spreadsheet and saved it
as a plain-text file, which you can download from our site here: movieData.txt. (Spreadsheets can be saved as raw
text in tab-separated or
comma-separated format. We saved ours with tabs as separators, because some of the movie
titles contain commas.) Working with text saved with tab-separated values can be a good way to orient yourself to regular expression patterns.
Our goal is to apply regular expressions to convert this text document (with thousands of lines) into an XML document without coding by hand!
.txt
extension, and you might rename this as
YourName_MovieData.txt
.) Step File
*.txt
) or markdown (*.md
) file and not something you
write in a word processor (not a Microsoft Word document) so you do not
have to struggle with autocorrections of the regex patterns you are recording.markdownin the search bar. (Or type in
textto open a new plain text file.) On Windows, you can find and open Notepad and record your steps in plain text form here outside of oXygen, which may be convenient, so you don’t accidentally try your find-and-replace operations on your step file instead of the main text. On Mac, you might try TextEdit, or stick with <oXygen/> and open your window in Tile View as we did with your Relax NG Schema files.
Your goal is to produce an XML version of the movie data file by using the
search-and-replace techniques we discussed in class, and record each
step you take in a plain text or markdown file so others can reproduce exactly what you
did. (You may, in a real-life project situation, need to share the
steps you take in up-converting plain text documents to XML, and share that on your
GitHub repo in GitHub’s markdown (the same that we write on the GitHub Issues
board), and in that case you would save the file with a .md
extension.
Your up-converted XML output should look something like movieData.xml. This involves putting each class period date in its own element and reformatting it to hold the full date information in an attribute. It also involves wrapping the three (M W and F) class period dates for each week in an element to wrap the weeks.
Your Steps file
needs to be detailed enough to indicate each step of your process:
what regular expression patterns you attempted to find, and what expressions you used to
replace them. You might record the number finds you get and even how you fine-tuned your
steps when you were not finding everything you wanted to at first.
Note: we strongly recommend copying and pasting your find
and replace expressions into your Steps file instead of retyping them
(since it is easy to introduce errors that way).
There are several ways to get to the target output, but the starting points are standard:
First of all, for any up-conversion of plain text, you must check for the special reserve
characters: the ampersand &
and the angle brackets <
and >
. You need to search for those and, if they turn up, replace them
with their corresponding XML entities, so that these will not interfere with well-formed
XML markup.
Search for: | Replace with: |
---|---|
& |
& |
< |
< |
> |
> |
Note that you need to process the special XML reserve characters in the correct order.
Why is it important that you search and replace the &
first?
To perform regex searching, you need to check the box labeled Regular expression
at the bottom of the <oXygen/> find-and-replace dialog box, which you open with
Control-f (Windows) or Command-f (Mac). If you don’t check this box, <oXygen/>
will just search for what you type literally, and it won’t recognize that some
characters in regex have special meaning. You don’t have to check anything else yet. Be
sure that Dot matches all
is unchecked, though; we’ll explain why below.
Our data is organized in lines of text, so we recommend starting by wrapping those lines
in a simple wrapper
element (<movie>....</movie>
to
isolate each line of data about each movie. We can then proceed to fine-tune the markup
and add more inside each move element working around the tab characters.
From each text row of movie data, we ultimately want to create this pretty-printed, structured XML markup (showing a sample of the data for the movie Operation Dunkirk:
<movie> <title>Operation Dunkirk</title> <date>2017</date> <location>USA</location> <time unit="min">96</time> </movie>
To get to this point, start by looking at the lines in the text file as you have it open
in oXygen. You'll see that each line is numbered. We can try working on this data from
the outside in
, that is, wrap each whole line in a wrapper element so each
movie’s data is contained in tags:
Operation Dunkirk 2017 USA 96 min
Use a Find and Replace operation to isolate each line with a simple regular expression. Then, in your replace, refer to the Find expressing to capture it, either as a whole unit, or as a capturing group.
<movie>Operation Dunkirk 2017 USA 96 min</movie>
The way regex really thinks of this process is, match every movie line, delete it, and
replace it with remixed pieces of itself wrapped in
. That is, regex doesn’t think about leaving the movie line in place and
inserting something before and after it; it thinks about matching each movie line,
deleting it, and then putting the whole thing back, with the tags that you desire. You
need to refer to what you want to keep (in this case the whole thing), as a
capturing group. When we want to keep the whole expression that we found,
the whole line of text here, we refer to capturing group 0 with <movie>
tags\0
.
Once you have isolated the movie lines and wrapped them in start and end tags, it is time to apply more detailed markup inside, to isolate each movie title, date, and location. We will do something special with the time unit, remixing that data to put the unit inside an attribute value. To do this work, you will need to learn how to mark and apply capturing groups.
To make capturing groups you set parentheses around the portions of your
regular expression that you want to keep. Think of setting capturing groups as a way to
isolate pieces of your Find
so that you can point to them and position them
exactly where you want in your Replace>
. Take your first step by locating the
<movie>
start tag that you just set down followed by just the
movie title (which is bordered by a tab character). Once you can find these things, wrap
the element tag in its own capturing group, and then the title information in second
capturing group.
In the replace, you will need to refer to the capturing groups using a special regular
expression. The sequence \1
points to the first capturing group, ordered
from left to right. \2
refers to the second capturing group. Remember, the
expression \0
refers to the entire match regardless of the capturing
groups. Try experimenting with Find and Replace using capturing groups in various ways
until you set down the tagging you want. (The Undo button in oXygen is under the Edit
menu, and we use it frequently when we are experimenting like this!)
We are not going to tell you how to create your regular expressions: part of the learning
process here is looking stuff up
in the tutorial sites we have provided, and
asking for help when you get stuck on our DIGIT-Coders Slack or by opening an issue on
our textAnalysis-Hub. Do your best to wrap the data you see in meaningful tags, even if
what you create does not look exactly like our sample XML.
Save your text file now as an XML file by saving as .xml
. You will now need
to reopen the document to see if it is well-formed so that oXygen
actually recognizes and reads the file as an XML document. It probably is not
well-formed, because you need to wrap the document in a root element. Do
that and inspect the document for well-formedness. To check for well-formedness in the
XML file, you can use Control+Shift+W on Windows, Command+Shift+W on Mac, or click on
the arrow next to the red check mark in the icon bar at the top and choose Check
well-formedness
. If you see regular patterns of something that you can fix with
regular expressions, use them and document your steps.
As we mention above, there are several ways to get to the target output, and whatever works is legitimate, as long as you make meaningful use of computational tools, including regular expressions (where appropriate), and don’t just tag everything manually. As you saw in class, there are ways to build your own regular expressions to match whatever patterns you need to identify, and the regex languages is complex and often difficult to read. The way we would approach this task is by figuring out what we need to match and then looking up how to match it. In addition to the mini-tutorial above, there is a more comprehensive tutorial information at http://www.regular-expressions.info/tutorialcnt.html. If you decide to look around for alternative reference sites and find something that seems especially useful, please post the URL on the discussion boards, so that your classmates can also consult it.
.md
) or plain text (.txt
)
document (a step-by-step description of what you did), and.xml
)If you don’t get all the way to a solution, just upload the description of what you did, what the output looked like, and why you were not able to proceed any further. As you are working on this, post any questions on Slack or our class GitHub Issues board!