19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual...

40
שיעור19 Conversion of primary resources http://digitalhumanities.org:3030/companion/view?docId=blackwell/978140 5103213/9781405103213.xml&chunk.id=ss1-5-2&toc.depth=1&toc.id=ss1-5- 2&brand=9781405103213_brand

Upload: others

Post on 22-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

19שיעור Conversion of primary resources

http://digitalhumanities.org:3030/companion/view?docId=blackwell/9781405103213/9781405103213.xml&chunk.id=ss1-5-2&toc.depth=1&toc.id=ss1-5-

2&brand=9781405103213_brand

Page 2: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

מקור ראשוני –חומרים ראשוניים • ״כל מה שיכול להיות מצולם יכול לעבור דיגיטציה״•לעבור דיגיטציה בצורה נאמנה / מקורות שונים יכולים להיות מצולמים •

יותרהשאלה מה זה ׳קרוב לאמת׳•

Page 3: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

סוגי חומרים המשמעותיים למדעי הרוח

)חומרים מודפסים, כתבי יד(מסמכים • תקופות שונות• שפות שונות• עץ, אבן, לוחות ברזל, פפירוס, קלף, נייר: חומרים מגוונים• )תווים, אלפבית(אותיות • )bundle(תוכן לא אחיד • )מכשור שונה –אין אחידות (גודל משתנה•אבל היום כבר יש תוצאות לא רעות ( ocrאפשרויות של , קריא או לא, כתב יד •

!)בענייןתווים, מכתבים, פוסטרים, עיתונים, ספרים כתבי עת –שנה 500 –חומר מודפס •

..אפמרה,

Page 4: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

חומרים ויזואליים • בדים וכדומה, פילם, זכוכית, נייר, קנבס• איורים על כתבי יד• , איורים, ציורים• הדפסים, שקופיות, לוחות זכוכית, נגאטיבים: צילומים•תוכניות, שרטוטים, מפות•

Page 5: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

מימדחומרים בתלת • שימוש במודלים בלבד• *ממצאים ארכיאולוגיים, בניינים, פיסול, אובייקטים ממוזיאונים•

המעבדה של ליאור גרוסמן בירושלים*•

•”paperless archaeology” Mathew Adams•-matthew-02aevaminerva/net/slideshare.www.https://

adamskeynotepaperless

Page 6: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

חומרים ״מבוססי זמן״• וידאו וקול, סרטים•”digital born”לא מעט •

?איך עושים דיגיטציה של ריקוד מחול –שאלות •

Page 7: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

The Nature of Digital Data

• Resolution (dpi, ppi)• Screen: 75-150 pixels per inch• Manuscripts – 300-600 dpi• 35 mm slides or microfilm originals will need to be captured at much

higher resolutions, and scanners are now available offering resolutions of up to 4,000 dpi for such materials.

• Creating an electronic photocopy of a plain page of text is not a complex technical process with modern equipment, but being able to then automatically recognize all the alphanumeric characters it contains, plus the structural layout and metadata elements, is a highly sophisticated operation.

Page 8: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance
Page 9: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

מה משפיע על הגישות לדיגיטציה

סוג החומרים• הסיבות לדיגיטציה• הציוד הטכני והתקציב העומד לרשות הפרויקט• השימושים האפשריים והמשתמשים הפוטנציאלים•

•the long-term survivability of the materials as well as for immediate

project needs

Page 10: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

The Advantages and Disadvantages of Digital Conversion

• Disadvantages:• is evanescent and mutable • it can disappear in a flash if a hard drive crashes or a CD is corrupt; • it can be changed without trace

• Advantages:• enables a much wider potential audience• gives a renewed means of viewing our cultural heritage, provided the project

is well thought out and well managed• and this applies however large or small the project might be.• The advantages of digitization for humanists include:

Page 11: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

• the ability to republish out-of-print materials• rapid access to materials held remotely• potential to display materials which are in inaccessible formats, for instance,

large volumes or maps• "virtual reunification" – allowing dispersed collections to be brought together• the ability to enhance digital images in terms of size, sharpness, color contrast,

noise reduction, etc.• the potential for integration into teaching materials• enhanced searchability, including full text• integration of different media (images, sounds, video etc.)• the potential for presenting a critical mass of materials for analysis or

comparison.

Page 12: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

• Need to assess the actual and potential user base, • What will change when materials are made available in digital form.• Fragile originals which are kept under very restricted access

conditions may have huge appeal to a wide audience when made available in a form which does not damage the originals.

Page 13: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

Questions:

• What it is that the digitization is aiming to capture. • Is the aim to produce a full facsimile of the original that when

printed out could stand in for the original?• מחברות של קפקא בספריה הלאומית, למשל

• Some projects have started with that aim, and then found that a huge ancillary benefit was gained by also having the digital file for online access and manipulation

Page 14: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

קיימת ביקורת לא מועטה על הדיגיטציה של ארכיון המדינה•:נכתב לי מהאחראי על הרשומות הדיגיטליות בארכיון הפדרלי השוויצרי•

Other than the obvious merits of accessibility (global 24/7 online access, much improved search capability through OCR/transcription), I’d argue that greatest benefit is that the digital format unlocks the information kept within the analog format by enabling the application of software tools and algorithms specifically made for the parsing and analysis of digital information. This mightily increases the potential for gaining knowledge from the information, and enables linking of information across institutions and research facilities (see potential of linked data, open government data platforms and IIIF International Image Interoperability Framework for example) and furthers collaboration. Archives are in themselves worthless until the information is used.

Page 15: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

•9%7D%9A%7D%95%7DB%9%7Dq=%search/?il/gov.archives.www.http://alltype=_searchD&9%7D%9

•aedc/0717068031b0Archive/archives/#/il/gov.archives.www.http://

cc38b071706804b0File/

Page 16: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

• The Digital Image Archive of Medieval Music (DIAMM) project• Goal: capture of a specific corpus of fifteenth-century British polyphony fragments for printed

facsimile publication in volumes such as the Early English Church Music (EECM) series. • Early studies showed that there was much to be gained from obtaining high-resolution digital

images in preference to slides or prints.• This was not only because of the evidence for growing exploitation of digital resources at that

time (1997), but also because many of these fragments were badly damaged and digital restoration offered opportunities not possible with conventional photography.

• The project decided to capture manuscript images in the best quality possible using high-end digital imaging equipment; to archive the images in an uncompressed form; to enhance and reprocess the images in order to wring every possible piece of information from them; to preserve all these images – archive and derivative – for the long term.

• That has proved to be an excellent strategy for the project, especially as the image enhancement techniques have revealed hitherto unknown pieces of music on fragments that had been scraped down and overwritten with text: digitization has not just enhanced existing humanities sources, it has allowed the discovery of new ones (see <http://www.diamm.ac.uk>).

Page 17: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

צורות אחרות לדיגיטציה/סיבות

• Capture the content of a source without necessarily capturing its form.

• “Digital edition” of the work of a literary author might be rekeyed and re-edited in electronic form without particular reference to the visual characteristics of an existing print or manuscript version.

• Add searchability to a written source while preserving the visual form, text might be converted to electronic form and then attached to the image.

Page 18: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

?מה רוצים לשמר

• What level of information is needed? • The intellectual content • The physical detail of brushstrokes, canvas grain, • The pores of the skin of the animal used to make the parchment

• Is some kind of analysis or reconstruction the aim?

Page 19: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

הדמיה תלת מימדית

• http://www.ynet.co.il/articles/0,7340,L-4975905,00.html

ונחשבו , בתל ערד60-על שברי חרס שנתגלו כבר במהלך שנות הנתגלו הנחיות לוגיסטיות ליחידות צבאיות -עד כה כנקיים מכתב

הגילוי התאפשר בזכות טכנולוגיית . לפני הספירה586משנת צילום חדשנית

Page 20: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

מסיכה מהתקופה הניאוליטית

http://www.yissum.co.il/technologies/project/37-2016-4384

Page 21: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

•Prof. Leore Grosman•

http://archaeology.huji.ac.il/depart/prehistoric/leoreg/photo.asp#1

Page 22: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

Methods of Digital Capture

:כשאפשר Optical character recognition (OCR) methods -- can give a relatively accurate result. Accuracy can then be improved using a variety of automated and manual methods: passing the text through spellcheckers with specialist dictionaries and thesauri, and manual proofing.

European Union-funded METAe Project (the Metadata Engine Project) -- developing automatic processes for the recognition of complex textual structures, including text divisions such as chapters, sub-chapters, page numbers, headlines, footnotes, graphs, caption lines etc.

Page 23: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

Optical capture- OCR techniques were originally developed to provide texts for the blind. - OCR engines can operate on a wide range of character sets and fonts, though they have problems with non-alphabetic character sets because of the large number of symbols, and also with cursive scripts such as Arabic. - Software can be "trained" on new texts and unfamiliar characters so that accuracy improves over time and across larger volumes of data.

OCR can give excellent results if (a) the originals are modern and in good condition(b) there is good quality control,

Human time is always the most costly part of any operation, and it can prove to be more time-consuming and costly to correct OCR It is worth bearing in mind that what seem like accurate results(between 95 and 99 percent, for instance) would mean that there would be between1 and 5 incorrect characters per 100 characters.

Assuming there are on average 5 characters per word then a 1 percent character error rate equates to a word error rate of 1 in 20 or higher.

Page 24: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

OCR with fuzzy matching• For some projects and purposes, accurate text to the highest level attainable is

essential, and worth what it can cost in terms of time and financial outlay. • For other purposes: speed of capture and volume are more important than quality • What needs to be taken into account is the reason a text is to be captured digitally and

made available. • If the text has important structural features which need to be encoded in the digital

version, and these cannot be captured automatically, or if a definitive edition is to be produced in either print or electronic form from the captured text, then high levels of accuracy are paramount.

• If, however, retrieval of the information contained within large volumes of text is the desired result, then it may be possible to work with the raw OCR output from scanners, without post-processing.

• A number of text retrieval products are now available which allow searches to be performed on inaccurate text using "fuzzy matching" techniques.

Page 25: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

Hybrid solutions: page images with underlying searchable text• The user is presented with a facsimile image of the original for

printing and viewing, and attached to each page of the work is a searchable text file.

• Decisions about the method of production of the underlying text will depend on the condition of the originals and the level of accuracy of retrieval required.

• One document type that responds well to hybrid solutions is newspapers, which are high-volume, low-value (generally), mixed media, and usually large in size.

Page 26: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

•The Forced Migration Online (FMO) project based at the Refugee Studies Centre, University of Oxford, is taking a different approach. FMO is a portal to a whole range of materials and organizations concerned with the study of the phenomenon of forced migration worldwide, with content contributed by an international group of partners.

•One key component of FMO is a digital library of gray literature and of journals in the field. The digital library is produced by attaching text files of uncorrected OCR to page images: the OCR text is used for searching and is hidden from the user; the page images are for viewing and printing. What is important to users of FMO is documents or parts of documents dealing with key topics, rather than that they can retrieve individual instances of words or phrases. This type of solution can deliver very large volumes of material at significantly lower cost than rekeying, but the trade-offs in some loss of accuracy have to be understood and accepted. Some of the OCR

but this can give rise to inaccuracies can be mitigated by using fuzzy search algorithms, s Active uses Olive Software'>) orgforcedmigration.www.FMO (<retrieval. -problems of over

Paper Archive, a product which offers automatic zoning and characterization of complex Deegan Seeas well as OCR and complex search and retrieval using fuzzy matching. documents,

>.newspaper/digitizing/digitalpreservation/org/oclc.www.http://and <)2002(•

Page 27: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

http://www.jpress.nli.org.il/Olive/APA/NLI/?action=search&text=%D7%97%D7%99%D7%A4%D7%95%D7%A9%D7%99%D7%9

5%D7%AA%20%D7%94%D7%A7%D7%A6%D7%91

http://www.olivesoftware.com/products/activepaper-archive-2/

Jpress,עיתונות היסטוריתעמודים 1,300,000

Page 28: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

Images

• digital images as primary source materials rather than as secondary surrogates:

• increasingly photographers are turning from film to digital, and • artists are creating digital art works from scratch. • Many digital images needed by humanists are taken from items outside

their control: objects that are held in cultural institutions. • These institutions have their own facilities for creating images which

scholars and students will need to use. • If they don't have such facilities, analogue surrogates (usually

photographic) can be ordered and digitization done from the surrogate. • The costs charged by institutions vary a great deal

Page 29: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

Technical issues in image capture• need a high level of fidelity to the original. • should be at an appropriate resolution, relative to the format and size of the original, and at an appropriate

bit depth. • "high" resolution is based upon factors such as original media size, the nature of the information, and the

eventual use.• Therefore 600 dpi would be considered high-resolution for a photographic print, • would be considered low-resolution for a 35 mm slide.

• It must be remembered that resolution is always a factor of two things: (1) the size of the original and (2) the number of dots or pixels.

- resolution calculated for a flatbed scanner, which has a fixed relationship with originals is expressed in dpi.- With digital cameras, which have a variable dpi in relation to the originals, given that they can be moved

closer or further away, resolution is expressed in absolute terms, either by their x and y dimensions (12,000 × 12,000, say, for the highest-quality professional digital cameras) or by the total number of pixels (4 million, for instance, for a good-quality, compact camera).

- The digital image itself is best expressed in absolute terms: if expressed in dpi, the size of the original always needs to be known to be meaningful.

Page 30: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

• Hardware for digital capture:• flatbed scanners, which are used for reflective and transmissive materials. • These can currently deliver up to 5,000 dpi, but can cost tens of thousands of dollars, so most

projects can realistically only afford scanners in the high-end range of 2,400 to 3,000 dpi. • Bespoke 35 mm film scanners, which are used for transmissive materials such as slides and film

negatives, can deliver up to 4,000 dpi. • Drum scanners may also be considered as they can deliver much higher relative resolutions and

quality, but they are generally not used in this context as the process is destructive to the original photographic transparency and the unit cost of creation is higher.

• Digital cameras can be used for any kind of material, but are generally recommended for those materials not suitable for scanning with flatbed or film scanners: tightly bound books or manuscripts, art images, three-dimensional objects such as sculpture or architecture.

• Digital cameras are becoming popular as replacements for conventional film cameras in the domestic and professional markets, and so there is now a huge choice. High-end cameras for purchase by image studios cost tens of thousands of dollars, but such have been the recent advances in the technologies that superb results can be gained from cameras costing much less than this – when capturing images from smaller originals, even some of the compact cameras can deliver archive-quality scans. However, they need to be set up professionally, and professional stands and lighting must be used.

Page 31: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

• For color scanning, the current recommendation for bit depth is that high-quality originals be captured at 24 bit, which renders more than 16 million colors – more than the human eye can distinguish, and enough to give photorealistic output when printed. For black and white materials with tone, 8 bits per pixel is recommended, which gives 256 levels of gray, enough to give photorealistic printed output.

• humanists will need for most purposes the highest possible quality for two reasons:• need fine levels of detail in the images• images will in many cases have been taken from rare or unique originals, which might also be

very fragile.

• Digital capture, wherever possible, should be done once only, and a digital surrogate captured that will satisfy all anticipated present and future uses.

• This surrogate is known as the "digital master" and should be kept under preservation conditions

• Any manipulations or post-processing should be carried out on copies of this master image. The digital master will probably be a very large file: the highest-quality digital cameras (12,000 ×12,000 pixels) produce files of up to 350Mb, which means that it is not possible to store more than one on a regular CD-ROM disk. A 35 mm color transparency captured at 2,700 dpi (the norm for most slide scanners) in 24-bit color would give a file size of 25 Mb, which means that around 22 images could be stored on one CD-ROM. The file format generally used for digital masters is the TIFF (Tagged Image File Format), a de facto standard for digital imaging. There are many other file formats available, but TIFF can be recommended as the safest choice for the long term. The "Tagged" in the title means that various types of information can be stored in a file header of the TIFF files

Page 32: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

Compression and derivatives• It is possible to reduce file sizes of digital images using compression techniques, though this is often not

recommended for the digital masters. • Compression comes in two forms: "lossless", meaning that there is no loss of data through the process, and

"lossy", meaning that data is lost, and can never be recovered.• There are two lossless compression techniques that are often used for TIFF master files, and which can be

recommended here – the LZW compression algorithm for materials with color content, and the CCITT Group 4 format for materials with 1-bit, black and white content.

• Derivative images from digital masters are usually created using lossy compression methods, which can give much greater reduction in file sizes than lossless compression for color and greyscale images.

• Lossy compression is acceptable for many uses of the images, especially for Web purposes. • However, excessive compression can cause problems in the viewable images, creating artifacts such as

pixelation, dotted or stepped lines, regularly repeated patterns, moire, halos, etc. • For the scholar seeking the highest level of fidelity to the originals, this is likely to be unacceptable, and so

experimentation will be needed to give the best compromise between file size and visual quality. • The main format for derivative images for delivery to the web or on CD-ROM is currently JPEG. • This is a lossy compression format that can offer considerable reduction in file sizes if the highest levels of

compression are used, but this comes at the cost of some compromise of quality. However, it can give color thumbnail images of only around 7 KB and screen resolution images of around 60 KB – a considerable benefit if bandwidth is an issue.

Page 33: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

Audio and video capture

• Time-based media. • Media studies is an important and growing area, and historians of

the modern period too derive great benefit from having digital access to time-based primary sources such as news reports, film, etc.

• Literary scholars also benefit greatly from access to plays, and to filmed versions of literary works.

Page 34: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

•Editing of Captured Content•The editing of the content captured is done via software tools known

as non-linear editing suites. These allow the content to be manipulated, edited, spliced, and otherwise changed to facilitate the

production of suitable content for the prospective end user. The ability to do this in real time is essential to the speed and accuracy of

the eventual output. Also, the editing suite should have suitable compressors for output to Web formats and Internet streaming.

Page 35: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

Metadata• Data must be documented properly so that curators and users of the future understand what it is

that they are dealing with. • Metadata is one of the critical components of digital resource conversion and use, and is needed at

all stages in the creation and management of the resource. • Any creator of digital objects should take as much care in the creation of the metadata as they do

in the creation of the data itself – time and effort expended at the creation stage recording good-quality metadata is likely to save users much grief, and to result in a well-formed digital object which will survive for the long term.

• Documentation of data must be done right from the start. • Technical or administrative metadata: Having archive-quality digital master files is useless if the

filenames mean nothing to anyone but the creator, and there is no indication of date of creation, file format, type of compression, etc.

• Descriptive metadata:• the attributes of the object being described and can be extensive: • attributes such as: "title", "creator", "subject", "date", "keywords", "abstract", etc.• many of the things that would be catalogued in a traditional cataloguing system.• It may be possible to request or supply project-specific metadata.• Descriptive metadata can only be added by experts who understand the nature of the source materials, and it

is an intellectually challenging task in itself to produce good descriptive metadata.

Page 36: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

ארכיאולוגיה חישובית

מטבעות••sparqlorg/nomisma.http://

חרסים• חפירות ללא נייר•

•http://www.deadseascrolls.org.il/explore-the-archive •http://www.deadseascrolls.org.il/featured-scrolls •http://www.deadseascrolls.org.il/home

Page 37: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

מגילות ים המלח

לפני הספירה בעיקר והשניההמאה הראשונה • עשויות קלף או פפירוס• גם בארמית ויוונית. מקורות עיקריים לעברית עתיקה•כנראה האיסיים המוזכרים אצל יוסף בן –ימי בית שני –״כת מדבר יהודה״ •

הצדוקיים הכהניםמתתיהו או ספרות פולמוסית, ספרים כיתתיים, ספרים חיצוניים, המקרא: תוכן•

התגלו על ידי רועים, שנות הארבעים•מלחמת בני אור בבני (רכש את המגילות ) אביו של יגאל ידין( סוקניקאליעזר •

)חושך חפירות •

Page 38: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

הלאומית הספריה

ארגון ״בני ברית״ 1892• )1925הוקמה (האוניברסיטה העברית • חברה לתועלת הציבור – 2007 –הלאומית הספריהחוק • ):עוד מימי המנדט(תקנה –עד החוק •סרט או כל , תקליט, עבודה מחקרית, כתב עת,חייב כל מי שמוציא ספר•

לשלוח שני עותקים לספרייה , פרסום אחר בחמישים עותקים ומעלה .הלאומית״

Page 39: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

OCR – Optical Character

• OCR, or Optical Character Recognition, is the conversion of an image of typed text into a searchable document.

• https://ilanarmiller.wordpress.com/2016/11/01/a-quick-guide-why-use-ocr-as-an-historian/

Page 40: 19 רועיש - cs.bgu.ac.ildhcs172/wiki.files/class19.bgu.dh.pdf · •"virtual reunification" –allowing dispersed collections to be brought together •the ability to enhance

SOFTWAREAdobe Acrobat Pro (Windows / Mac, closed source, commercial)Acrobat Pro is available to all UC Berkeley affiliates via campus license. Claim your Adobe Creative Cloud license to install Acrobat on your machine or utilize ETS computer facilities.ABBYY Finereader (Windows / Mac, closed source, commercial with educational discount and free trial)ABBYY FineReader is a robust tool for OCR. ABBYY FineReader works well with digital camera images, unusually structured text (e.g. magazine layouts, newspaper columns), offers automated workflows for conversion, and supports up to 190 languages.Tesseract (Windows / Mac / Linux, open source, free)Tesseract is an open source OCR engine. It can be used directly (via the command line) or with an API. Several third-party graphical user interfaces (GUI) are available for users who would like a drag-and-drop interface. Specialized packages for working with different languages and scripts, such as cuneiform and Vietnamese, are also available. Read Ammon Shepherd’s “Watermarking and OCR-ing Your Images” blog for a short walkthrough of using Tesseract without a GUI. Shepherd also provides scripts for batch processing.Google Docs (Web, free)Google Docs allows users to perform OCR on uploaded images and PDFs. See this blog for a walkthrough and screenshots. Read about recommended document specifications here.

http://digitalhumanities.berkeley.edu/resources/digitization-workflows-scanning-ocr-and-audio-transcription