normalizing data for migrations

Normalizing Data for Migration

Kyle [email protected]

Migrations are a fact of life

Acquisitions dataItem data ERM

bibliographic

Patron data Statistics

Holdings Information

Content Management Systems

Link resolver Circulation data

Archival management software

Institutional Repository

You can do a lot without programming skills

Absolutely!

✓ Carriage returns in data

✓ Retain preferred value of multivalued fields

✓ Missing or invalid data

✓ Find problems following complex patterns

Maybe..

？ Conditional logic

？ Changes based on multifield logic

？ Convert free text fields to discrete values

Excel

● Mangles your data○ Barcodes, identifiers, and numeric data

at risk

● Cannot fix carriage returns in data

● Crashes with large files

● OpenRefine is a better tool for situations where you think you need Excel http://openrefine.org

http://openrefine.org/

Keys to success

🔑Understand differences between the old and new systems

🔑Manually examine thousands of records

🔑Learn regular expressions

🔑Ask for help!

Watch out for

✓ Creative use of fields○ Inconsistencies and changing policies○ Embedded code○ Data that exploits buggy behavior

✓ Different data structures○ Acq, licensing, electronic, items, etc

✓ Different types of data within fields (e.g. codes vs. text)

CONTENTdm migration example

● XML metadata export contained errors on every field that contained an HTML entity (& < > " ' etc)

<dc:subject>Oregon Health &amp</dc:subject><dc:subject> Science University</dc:subject>

● Error occurs in many fields scattered across thousands of records

● But this can be fixed in seconds!

Regular expressions to the rescue!

● “Whenever a field ends in an HTML entity minus the semicolon and is followed by an identical field, join those into a single field and fix the entity. Any line can begin with an unknown number of tabs or spaces”

/^\s*<$[^>]\+>$$.*$$&[a-z]\+$<\/\1\n\s*<\1/<\1\2\3;/

Regular expressions can...

● Use logic, capitalization, edges of words/lines, express ranges, use bits (or all) of what you matched in replacements

● Convert free text into XML into delimited text or codes and vice versa

● Find complex patterns using proximity indicators and/or involving multiple lines

● Select preferred versions of fields

Confusing at first, but easier than you think!

● Works on all platforms and is built into a lot of software

● Ask for help! Programmers can help you with syntax

● Let’s walk through our example which involves matching and joining unknown fields across multiple lines...

Regular Expression Analysis/^\s*<$[^>]\+>$$.*$$&[a-z]\+$<\/\1\n\s*<\1/<\1\2\3;/

^ Beginning of line

\s*< Zero or more whitespace characters followed by “<”

$[^>]\+>$ One or more characters that are not “>” followed by “>” (i.e. a tag). Store in \1

$.*$ Any characters to next part of pattern. Store in \2

$&[a-z]\+$ Ampersand followed by letters (HTML entities). Store in \3

<\/\1\n “</ followed by \1 (i.e. the closing tag) followed by a newline

\s*<\1 Any number of whitespace characters followed by tag \1

/<\1\2\3;/ Replace everything up to this point with “<” followed by \1 (opening tag), \2 (field contents), \3, and “;” (fix HTML entity). This effectively joins the fields

A simpler example

● Find a line that contains 1 to 5 fields in a tab delimited file (because you expect 6)

^$[^\t]*\t$\{0,4}[^\t]*$

● To automatically join it with the next line with a space

/^$\([^\t]*\t$\{0,4}[^\t]*\)\n/\1 /

However, it would be much safer and easier to use syntax that detects the first or last field

If you want a GUI, use OpenRefine

http://openrefine.org

● Sophisticated, including regular expression support and ability to create columns from external data sources

● Convert between different formats

● Up to a couple hundred thousand rows

Normalization is more conceptual than technical

● Every situation is unique and depends on the data you have and the config of the new system

● Don’t fob off data analysis on technical people who don’t understand library data

● It’s not possible to fix everything because the systems work differently (if they didn’t, migrating would be pointless)

Questions?

Kyle [email protected]

normalizing data for migrations

Education