robot.txt와 meta tag를 이용한 크롤링 설정

NHN�� NEXT�� 보안스터디�� /�� 정윤성

Security From Internet Crawling Robot

이미지출처 : http://www.dailygalaxy.com/my_weblog/internet/

그전에..검색�� web�� Crawler의�� 동작원리

Web Crawler가 주기적으로 수집하는 결과를 (내용, xml, html 등)

내부적으로 저장, 사용자에게 빠르게 제공

@Chache

검색�� Web�� Crawler의�� 동작원리Robot.txt�� 파일을�� 참조

접근권한, 접근가능한 경로 등을 분석해서 수집해도 되는 컨텐트만을 수집

http://www.HomePage.com/robot.txt.

검색�� Web�� Crawler의�� 동작원리robots.txt

this is called The�� Robots�� Exclusion�� Protocol

Web site owners use the /robots.txt file to give instructions about their site to web robots;

로봇�� 배제�� 표준의미

출처 : http://ko.wikipedia.org/wiki/로봇_배제_표준

로봇 배제 표준은 웹 사이트에 로봇이 접근하는 것을 방지하기 위한 규약

이 규약은 1994년 6월에 처음 만들어졌고, 아직 이 규약에 대한 RFC는 없다.

이 규약은 권고안이며, 로봇이 robots.txt 파일을 읽고 접근을 중지하는 것을 목적으로 한다.

따라서, 접근 방지 설정을 하였다고 해도, 다른 사람들이 그 파일에 접근할 수 있다.

robots.txt서술방법

http://www.robotstxt.org/ 을 통해 자세한 내용을 확인해할 수 있다

1. 웹사이트의 최상위 Root에 robot.txt 파일이 존재해야 한다.2. 파일이름은 공백이 없는 소문자로 작성해야 한다.3. User-agent는 bot을 명시한다.

.

.

.

http://www.robotstxt.org/


robots.txt예시�� 1

모든로봇에 대해모든페이지 허용안함

User-agent: *Disallow: /

User-agent: *Disallow:

모든로봇에 대해모든페이지 허용


User-agent: *Disallow: /a/Disallow: /b/Disallow: /c/

모든로봇에 대해a패키지 허용안함b패키지 허용안함c패키지 허용안함


User-agent: BadBotDisallow: /

BadBot에 대해모든패키지 허용안함

User-agent: GoogleDisallow:User-agent: *Disallow: /

Google에 대해모든패키지 허용나머지 Bot에 대해모든패키지 허용안함

robots.txt�� Generatorhttp://www.mcanerin.com/en/search-engine/robots-txt.asp

M패키지�� 단위로�� 관리��

(HTML�� Meta�� Element)

파일별로 Crawler에게 공개 여부를 알려줄 수 없을까?

package�� SeperationPath,�� Directory별로�� 접근을�� 막기에는�� 힘들다

Meta�� Tag��

(HTML�� Meta�� Element)

Meta�� Tag?웹페이지에�� 대한�� 구조적인�� 메타데이터를�� 제공하기�� 위한��

HTML,�� XHTML의�� <meta�� ...>�� 형태의�� 태그

이미지출처 : http://www.dzineblog360.com/2012/02/7-steps-of-on-page-of-seo/

<meta name="keywords" content=”검색되고자 하는 TAG" /><meta name="description" content=”설명에 대해서 여기다 쓰면 됩니다." />예시)

Robot과�� Meta�� Tag?검색엔진�� 등록과,�� 거부

예시) <meta name="robots" content="noindex, nofollow" />

<meta name=“robots”

작성규칙Header에�� 위치

<meta 속성=“값” content=“내용물” />

작성규칙예시�� 1

<meta name=“title” content=“robots.txt” />

<meta name=“author” content=“정윤성” />

<meta http-equiv=“refresh” content=“5;url=http://new.nhnnext.org”/>

작성규칙예시�� 2

<meta name=“keyword” content=“ec2, robots.txt, meta tag” />

<meta name=“description” content=“인터넷 보안에 대한 글” />

Robots�� Meta�� Tag�� 본론으로..

<meta name=“robots”

Page�� allow,�� Link�� allowContent에�� 2가지�� 항목을�� 작성

<meta name=“robots” contents=“index,�� follow”/>

noindex,�� nofollow

Page�� allow,�� Link�� allowContent에�� 2가지�� 항목을�� 작성

index�� noindex

follow�� nofollow��

해당 페이지의 Crawling 여부

페이지에 존재하는 링크 Crawling 여부

Page�� allow,�� Link�� allow예시

: 이 문서내용을 가져가고, 링크된 문서도 내용을 가져간다. <meta name=“robots” contents=“index, follow”/>

: 이 문서내용을 가져가고 않고, 링크된 문서는 내용을 가져간다. <meta name=“robots” contents=“noindex, follow”/>

: 이 문서내용을 가져가고, 링크된 문서는 무시한다.<meta name=“robots” contents=“index, nofollow”/>

: 이 문서내용을 가져가지 않고, 링크된 문서도 무시한다. <meta name=“robots” contents=“noindex, nofollow”/>

Q.�� 얼마나�� 실용적인가

-�� robots.txt는�� 권고사항�� (잘�� 제작된�� 모든�� 로봇은�� 파일의�� 지시를�� 존중)��

-�� 대부분의�� 거의모든�� 유해�� bot은�� robots.txt를�� 무시한다��

-�� 오히려�� 취약점을�� 드려내는�� 꼴이�� 되기도�� 한다.

ConclusionIs�� Practical?

Q.�� robots.txt,�� 법적�� 효력이�� 있는가

-�� no



Q.�� robots.txt를�� 기반으로�� 아이피�� 차단?

-�� Lots�� of�� Different�� IP??


A.�� 결국�� FireWall

Reference

http://cqcounter.com/whois/http://www.projecthoneypot.org/search_ip.php

Useful�� HomPage

Ip valification :


user-agent list check : http://www.botsvsbrowsers.com/

robots.txt generator : http://www.mcanerin.com/en/search-engine/robots-txt.asp

robots.txt public homepage :

http://cqcounter.com/whois/



http://www.projecthoneypot.org/search_ip.php




http://www.botsvsbrowsers.com/

http://www.botsvsbrowsers.com/

http://www.mcanerin.com/en/search-engine/robots-txt.asp



robot.txt와 meta tag를 이용한 크롤링 설정

Technology