normalizing data for migrations

17
Normalizing Data for Migration Kyle Banerjee [email protected]

Upload: kyle-banerjee

Post on 02-Aug-2015

108 views

Category:

Education


6 download

TRANSCRIPT

Page 1: Normalizing Data for Migrations

Normalizing Data for Migration

Kyle [email protected]

Page 2: Normalizing Data for Migrations

Migrations are a fact of life

Acquisitions dataItem data ERM

bibliographic

Patron data Statistics

Holdings Information

Content Management Systems

Link resolver Circulation data

Archival management software

Institutional Repository

Page 3: Normalizing Data for Migrations

You can do a lot without programming skills

Absolutely!

✓ Carriage returns in data

✓ Retain preferred value of multivalued fields

✓ Missing or invalid data

✓ Find problems following complex patterns

Maybe..

? Conditional logic

? Changes based on multifield logic

? Convert free text fields to discrete values

Page 4: Normalizing Data for Migrations
Page 5: Normalizing Data for Migrations

Excel

● Mangles your data○ Barcodes, identifiers, and numeric data

at risk

● Cannot fix carriage returns in data

● Crashes with large files

● OpenRefine is a better tool for situations where you think you need Excel http://openrefine.org

Page 6: Normalizing Data for Migrations

Keys to success

🔑Understand differences between the old and new systems

🔑Manually examine thousands of records

🔑Learn regular expressions

🔑Ask for help!

Page 7: Normalizing Data for Migrations

Watch out for

✓ Creative use of fields○ Inconsistencies and changing policies○ Embedded code○ Data that exploits buggy behavior

✓ Different data structures○ Acq, licensing, electronic, items, etc

✓ Different types of data within fields (e.g. codes vs. text)

Page 8: Normalizing Data for Migrations

CONTENTdm migration example

● XML metadata export contained errors on every field that contained an HTML entity (& < > " ' etc)

<dc:subject>Oregon Health &amp</dc:subject><dc:subject> Science University</dc:subject>

● Error occurs in many fields scattered across thousands of records

● But this can be fixed in seconds!

Page 9: Normalizing Data for Migrations

Regular expressions to the rescue!

● “Whenever a field ends in an HTML entity minus the semicolon and is followed by an identical field, join those into a single field and fix the entity. Any line can begin with an unknown number of tabs or spaces”

/^\s*<\([^>]\+>\)\(.*\)\(&[a-z]\+\)<\/\1\n\s*<\1/<\1\2\3;/

Page 10: Normalizing Data for Migrations

Regular expressions can...

● Use logic, capitalization, edges of words/lines, express ranges, use bits (or all) of what you matched in replacements

● Convert free text into XML into delimited text or codes and vice versa

● Find complex patterns using proximity indicators and/or involving multiple lines

● Select preferred versions of fields

Page 11: Normalizing Data for Migrations

Confusing at first, but easier than you think!

● Works on all platforms and is built into a lot of software

● Ask for help! Programmers can help you with syntax

● Let’s walk through our example which involves matching and joining unknown fields across multiple lines...

Page 12: Normalizing Data for Migrations

Regular Expression Analysis/^\s*<\([^>]\+>\)\(.*\)\(&[a-z]\+\)<\/\1\n\s*<\1/<\1\2\3;/

^ Beginning of line

\s*< Zero or more whitespace characters followed by “<”

\([^>]\+>\) One or more characters that are not “>” followed by “>” (i.e. a tag). Store in \1

\(.*\) Any characters to next part of pattern. Store in \2

\(&[a-z]\+\) Ampersand followed by letters (HTML entities). Store in \3

<\/\1\n “</ followed by \1 (i.e. the closing tag) followed by a newline

\s*<\1 Any number of whitespace characters followed by tag \1

/<\1\2\3;/ Replace everything up to this point with “<” followed by \1 (opening tag), \2 (field contents), \3, and “;” (fix HTML entity). This effectively joins the fields

Page 13: Normalizing Data for Migrations

A simpler example

● Find a line that contains 1 to 5 fields in a tab delimited file (because you expect 6)

^\([^\t]*\t\)\{0,4}[^\t]*$

● To automatically join it with the next line with a space

/^\(\([^\t]*\t\)\{0,4}[^\t]*\)\n/\1 /

However, it would be much safer and easier to use syntax that detects the first or last field

Page 14: Normalizing Data for Migrations

If you want a GUI, use OpenRefine

http://openrefine.org

● Sophisticated, including regular expression support and ability to create columns from external data sources

● Convert between different formats

● Up to a couple hundred thousand rows

Page 15: Normalizing Data for Migrations
Page 16: Normalizing Data for Migrations

Normalization is more conceptual than technical

● Every situation is unique and depends on the data you have and the config of the new system

● Don’t fob off data analysis on technical people who don’t understand library data

● It’s not possible to fix everything because the systems work differently (if they didn’t, migrating would be pointless)

Page 17: Normalizing Data for Migrations

Questions?

Kyle [email protected]