arcomem training specifying-crawls

36
Specifying crawls France Lasfargues Internet Memory Foundation Paris, France [email protected] Slide 1

Upload: arcomem

Post on 06-May-2015

773 views

Category:

Technology


1 download

DESCRIPTION

This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.

TRANSCRIPT

Page 1: Arcomem training specifying-crawls

Specifying crawls

France Lasfargues Internet Memory Foundation

Paris, [email protected]

Slide 1

Page 2: Arcomem training specifying-crawls

Goal

➔ Help user to specify properly the campaign

➔ Make user understanding what it is going on in the back end of the ARCOMEM platform

➔ Set-up a campaign in the crawler cockpit

Slide 2

Page 3: Arcomem training specifying-crawls

Plan

What is the Web ? Challenges and SOAARCOMEM platformCrawlerSet-up a campaign in the Arcomen Crawler

Cockpit

Slide 3

Page 4: Arcomem training specifying-crawls

Introduction : How does web work ?

➔ The web is managed by protocols and standards :

• HTTP Hypertext Transfer Protocol

• HTML HyperText Markup Language

• URL Uniform Resource Locator

• DNS Domain Name System

➔ Each server has an address : IP address

• Example : http://213.251.150.222/ -> http://collections.europarchive.org

4

Page 5: Arcomem training specifying-crawls

WWW

The web is a large space of communication and information :• managed by servers which talk together by convention (protocol) and

through applications in a large network.

• a naming space organized and controlled (ICANN)

World Wide Web: abbreviated as WWW and commonly known as the Web, is a system of interlinked hypertext documents accessed via the internet

Slide 5

Page 6: Arcomem training specifying-crawls

HTTP - Hypertext Transfer Protocol

➔Notion client/server– request-response protocol in the client-

server computing model

➔How does it work ?– Client asks for a content

– Server hosts the content and delivers it

– The browser locates the DNS server, connects itself to the server and sends a request to the server.

6

Page 7: Arcomem training specifying-crawls

HTML - HyperText Markup Language

➔Markup language for Web page

➔Written in form of HTML elements

➔Creates structured documents denoting structural semantic elements for text as headings, paragraphs, titles, links, quotes, and other items

➔Allows text and embedded as images

➔Example : http://www.w3.org/

7

Page 8: Arcomem training specifying-crawls

URI - URL

➔ URL - Uniform resource Locator (URL) that specifies where an identified resource is available and the mechanism for retrieving it.

➔ Examples :

– http://host.domain.extension/path/pageORfile

– http://www.europarchive.org

– http://collections.europarchive.org/

– http://www.europarchive.org/about.php

8

Samos 2013 – Workshop : The ARCOMEM Platform

Page 9: Arcomem training specifying-crawls

Domain name and extension

➔ Manage by l’ICANN, Internet Corporation for Assigned Names and Numbers (ICANN), is non profit organization, allocated by registrar.• http://www.icann.org

➔ ICANN coordinates the allocation and assignment to ensure the universal resolvability of :

• Domain names (forming a system referred to as «DNS»)

• Internet protocol («IP») addresses

• Protocol port and parameter numbers.

➔ Several types of TLD• TLD first level : .com, .info, etc

• gTLD : aero, .biz, .coop, .info, .museum, .name, et .pro

• ccTLD (country code Top Level Domains).fr

9

Page 10: Arcomem training specifying-crawls

What kind of contents?

➔ Different type of contents : multimedia text, video, images

➔ Different type of producers :

• public : institution, government, museum, TV....

• private : foundation, company, press, people, blog...

http://ec.europa.eu/index_fr.htm

http://iawebarchiving.wordpress.com/

http://www.nytimes.com/

➔ Each producer is in charge of its content

• Information can disappear: fragility

• Size

10

Page 11: Arcomem training specifying-crawls

Social web

➔ Focus on people’s socialization and interaction

• Characteristics : • Walled space in wich users can interact

• Creation of social network

➔ WEB ARCHIVE -> challenges in term of content, privacy and technique.• Examples:

• Share bookmark(Del.icio.us, Digg), videos (Dailymotion, YouTube), photos (Flickr, Picasa)

• community (MySpace, Facebook)

11

Page 12: Arcomem training specifying-crawls

Ex. of technical difficulties: Videos➔ Standard HTTP protocol

• obfuscated links to the video files

• dynamic playlists and channels or configuration files loaded by the player several hops and redirects to the server of the video content

e.g.: YouTube

➔ Streaming protocols: RTSP, RTMP, MMS...

• real-time protocols implemented by the video players suited for large video files (control commands) or live broadcasts

• sometimes proprietary protocols (e.g.: RTMP - Adobe)

available tools: MPlayer, FLVStreamer, VCL

12

Page 13: Arcomem training specifying-crawls

Deep /Hidden Web

• Deep web: content accessible behind password, database, payment... and hidden to search engine

13

http://c.asselin.free.fr/french/schema_webinvisible.htm Schema établit sur la base de la figure"Distribution des sites du Deep Web par types de contenu" de l'étude Bright Planet.

Page 14: Arcomem training specifying-crawls

How do we archive it ?

➔ Challenges for archiving : – dynamic websites

➔ Technical barriers:• some javascript• animation on Flash• pop-up• video and audio on streaming• restricted access

➔Traps : Spam and loop

14

Page 15: Arcomem training specifying-crawls

What do user need to do some web archiving ?

➔Define the target content (Website, URL, Topic…)

➔A tool to manage its campaign ➔Intelligent crawler to archive content

15

Page 16: Arcomem training specifying-crawls

Management tools (1) ➔ Netarchivesuite (http://netarchive.dk/suite/)

➔ Web curator tool: http://webcurator.sourceforge.net

– Open-source workflow management application for selective web archiving developped by the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium

➔ Archive-it http://www.archive-it.org/

• A subscription service by Internet Archive to build and preserve collections: allows to harvest, catalog, manage and browse archived collections

➔ Archivethe.net http://archivethe.net/fr/

• Service provides by the Internet Memory Foundation.

➔ Arcomem crawler cockpit

16

Page 17: Arcomem training specifying-crawls

How does a crawler work ?

• A crawler is a bot parsing web pages in order to index or and archive them. Robot navigates following links

➔ Link in the center of crawl’s problematic

• Explicit links : source code is available and full path is explicitly stated

• Variable link : source code is available but use variables to encode the path

• Opaque links: source code not available

Example : http://www.thetimes.co.uk/tto/news/

17

Page 18: Arcomem training specifying-crawls

Parameters➔ Scoping function is used to define how depth the crawl

will go

• Complete or specific content of a website

• Discovery or focus crawl

➔ Politeness

• Follow the common rules of politeness

➔ Robots.txt

• Follow

➔ Frequency

• How often I want to launch a crawl on this target ?

18

Page 19: Arcomem training specifying-crawls

Source code: http:/www.arcomem.eu/!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="de-DE">

<head profile=http://gmpg.org/xfn/11>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

<meta name="distribution" content="global" />

<meta name="robots" content="follow, all" />

<meta name="language" content="en" />

<meta name="bitly-verification" content="59eb4f9028ea"/>

<meta name="verify-v1" content="7XvBEj6Tw9dyXjHST/9sgRGxGymxFdHIZsM6Ob/xo5E=" />

<title> ARCOMEM</title>

• <div id="navbar">

<div class="menu"><ul class="menu"><li class="page_item page-item-1490"><a href="http://www.arcomem.eu/ipres-2013/" title="iPres 2013">iPres 2013</a></li><li class="page_item page-item-1478"><a href="http://www.arcomem.eu/system-demos/" title="SYSTEM DEMOS">SYSTEM DEMOS</a><ul class='children'><li class="page_item page-item-1502"><a href="http://www.arcomem.eu/system-demos/technology-demos/" title="Technology Demos">Technology Demos</a></li></ul></li><li class="page_item page-item-2"><a href="http://www.arcomem.eu/about/" title="ABOUT ARCOMEM">ABOUT ARCOMEM</a><ul class='children'><li class="page_item page-item-14"><a href="http://www.arcomem.eu/about/use-cases/" title="USE CASES">USE CASES</a></li><li class="page_item page-item-16"><a href="http://www.arcomem.eu/about/research/" title="R&amp;D CHALLENGES">R&#038;D CHALLENGES</a></li></ul></li><li class="page_item page-item-20"><a href="http://www.arcomem.eu/downloads/" title="DOWNLOADS">DOWNLOADS</a><ul class='children'><li class="page_item page-item-1043"><a href="http://www.arcomem.eu/downloads/code/" title="CODE">CODE</a></li><li class="page_item page-item-973"><a href="http://www.arcomem.eu/downloads/deliverables/" title="DELIVERABLES">DELIVERABLES</a></li></ul></li><li class="page_item page-item-798"><a href="http://www.arcomem.eu/videos/" title="VIDEOS">VIDEOS</a></li><li class="page_item page-item-761"><a href="http://www.arcomem.eu/dissemination-activities/" title="DISSEMINATION ACTIVITIES">DISSEMINATION ACTIVITIES</a><ul class='children'><li class="page_item page-item-1235"><a href="http://www.arcomem.eu/dissemination-activities/past-dissemination-activities/" title="PAST ACTIVITES">PAST ACTIVITES</a></li><li class="page_item page-item-912"><a href="http://www.arcomem.eu/dissemination-activities/publications/" title="PUBLICATIONS">PUBLICATIONS</a></li><li class="page_item page-item-888"><a href="http://www.arcomem.eu/dissemination-activities/icwsm-2012-workshop/" title="ICWSM 2012">ICWSM 2012</a></li><li class="page_item page-item-1004"><a href="http://www.arcomem.eu/dissemination-activities/kecsm2012/" title="KECSM 2012">KECSM 2012</a></li></ul></li><li class="page_item page-item-1157"><a href="http://www.arcomem.eu/related-projects-2/" title="RELATED PROJECTS">RELATED PROJECTS</a></li><li class="page_item page-item-282"><a href="http://www.arcomem.eu/contact/" title="CONTACT">CONTACT</a></li></ul></div>

19

Page 20: Arcomem training specifying-crawls

ARCOMEM Workflow

20

Page 21: Arcomem training specifying-crawls

Memory Bot• Component Name: IMF Large Scale Crawler

– The large scale crawler retrieves content from the web and stores it in an HBase repository. It aims at being scalable: crawling at a fast rate from the start and slowing down as little as possible as the amount of visited URLs grows to hundreds of millions, all while observing politeness conventions (rate regulation, robots.txt compliance, etc.).

• Input:

– URLs with a score (seeds, then URLs output by the analysis process)

• Output:

– Web resources written to WARC files. We also have developed an importer to load these WARC files into HBase. Some metadata is also extracted: HTTP status code, identified out links, MIME type, etc.

21

Page 22: Arcomem training specifying-crawls

WARC

22

Page 23: Arcomem training specifying-crawls

Memory Bot Trap rules

➔ Number of path segments (for the url http://www.example.com/a/b/c/ we have a 3 path segments, a, b and c); default max is 5

➔ Parameter=value repetitions in the query (for the url http://www.example.com?a=1&a=1&a=2 - 2 repetitions default max is 5

➔ Filter out those urls with parameters whose names start with "b_start" and is longer than 20 chars

➔ Calendar and forum regular expressions

➔ maximum number of consecutive repetitions of the longest path segment (for the path "/a/b/c/a/b/c/d/a/b/c" the longest path segment is /a/b/c and it appears 2 times consecutively); default max is 3

➔ Obs: we truncate all URLs to 256 chars

23

Page 24: Arcomem training specifying-crawls

Adaptative Heritrix

➔ Component Name: Adaptive Heritrix

➔ Description: Adaptive Heritrix is a modified version of the open source crawler Heritrix that allows the dynamic reordering of queued URLs and receiving URLs from the Online Analysis module.

24

Page 25: Arcomem training specifying-crawls

How does adaptative Heritrix work ?

➔ Prioritisation module communicates new scores to the crawler queue using a JSON over HTTP Prioritisation module sends POST to http://QUEUE_SERVER/update. The request body is a JSON encoded array of update objects.

➔ {"url": "http://google.com/", "score": 0.3, "parentUrl": "http://seed.tld/page"},

➔ {"url": "http://spam.net/", "blacklisted": true, "parentUrl": "http://seed.tld/page"}

25

Page 26: Arcomem training specifying-crawls

API Crawler➔ Component Name: API Crawler

➔ Description: • The API Crawler is a solution to manage keyword-based crawls

of different social platforms using their Web APIs. It is controlled via a RESTful Web interface. Scalability and Performance: 3000 requests per hour, millions of triples per hour, millions of links per hour

➔ Input: List of tuples (keyword, platform)

➔ Output: Triples stored in the triple store and WARC files stored in the HDFS

➔ Twitter restriction: 180 request /15mn one request is one criteria. Each request give back 100 answers

26

Page 27: Arcomem training specifying-crawls

How does API crawler work ?

➔ Principles: a crawler runs crawls. Each crawl has a crawl ID assigned by the pipeline. The pipeline ensures crawl IDs are unique. A crawl has four states: running, stopped, being deleted, deleted. A crawl runs until it ends by itself or until a stop order is received. Only a stopped crawl can be deleted.

➔ The APCrawler produces three kind of data:

– semi-structured data stored as triples in the triple store,

– outlinks sent to Heritrix or the IMF crawler,

– and WARC files saved in the file system, that will also possibly be inserted into HBase.

27

Page 28: Arcomem training specifying-crawls

Output: triples

28

Page 29: Arcomem training specifying-crawls

ICS: Intelligent crawl specifications

29

Page 30: Arcomem training specifying-crawls

Application Aware helper➔ Component Name: Application-aware helper

– The goal of this software component is to make the crawler aware of the particular kind of Web application being crawled, in terms of general classification of websites (wiki, social network, blog, web forum, etc.), technical implementation (Mediawiki, Wordpress, etc.), and their specific instances (Twitter, CNN, etc.).

➔ Input:

– HTML content as string, base URL, list of out-links

➔ Output:

– Augmented document (original text document and structured objects extracted from web page) and extracted links with score will be sent to ARCOMEM framework module. Extracted semantic objects, crawling actions, and out-links with score will also be stored in the ARCOMEM database.

30

Page 31: Arcomem training specifying-crawls

ARCOMEM Crawler

31

Page 32: Arcomem training specifying-crawls

How does AAH work ?

➔ The application aware helper will be assisted with a knowledge base that will help in recognizing a specific web application and related crawling actions

➔ Since the knowledge base will grow and there will exist several detection patterns for many web applications, we have to ensure the web application detection module does not slow up the crawling process and affect overall performance.

➔ To ensure scalability, after integration of the application aware helper with the crawler, we have used the Yfilter system (a NFA based filtering system) for efficient indexing of detection patterns in order to quickly find the relevant Web application.

➔ Here each state is represented by XPath expression patterns and common steps of the path expression are represented only once in a structure. The introduction of Yfilter in the Web application detection module improves the performance dynamically and now the system is well synchronized with the other sub modules of crawling process.

32

Page 33: Arcomem training specifying-crawls

Set up a campain in CC

33

Page 34: Arcomem training specifying-crawls

Scoping function

34

Domain: entire web sitehttp://www.site.com

Path: only a specific directory of a websitehttp://www.site.com/actu

Sub domain: http://sport.site.com

Page + context: http://www.site.comhome.html

Page 35: Arcomem training specifying-crawls

Target content

35

Add in this part your target content

Page 36: Arcomem training specifying-crawls

Schedule

36

Frequency: weekly, monthly, quaterly …Interval: 1 to 9Calendar: a campaign has a start date and an end date.