Download - GDG İstanbul Şubat Etkinliği - Sunum
Web Crawling Web Scraping
cuneytykaya
cuneyt.yesilkaya
Cüneyt Yeşilkaya
2007
2048
......... 20102012
Agenda
● Web Crawling● Web Scraping● Web Crawling Tools● Demo (Crawler4j & Jsoup)● Crawling - Where to Use
Web Crawling
Browsing the World Wide Web in a methodical, automated manner or in an orderly fashion.
Web Scraping
Computer software technique of extracting information from websites.
Web Crawling Tools
Selecting Crawler ?
● Multi-Threaded Structure● Max Page to Fetch● Max Page Size● Max Depth to Crawl● Redundant Link Control● Politeness Time● Resumable● Well-Documented
Crawler4j
Yasser Ganjisaffar
Microsoft Bing & Microsoft Live Search
Demo - Crawler4j (1/3)
myCrawler.java myController.java
Demo - Crawler4j (2/3)
myCrawler.java
import edu.uci.ics.crawler4j.crawler.WebCrawler; public class myCrawler extends WebCrawler { @Override public boolean shouldVisit(WebURL url) { return url.getURL().startsWith("http://www.gdgistanbul.com"); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); }}
Demo - Crawler4j (3/3)
myController.java
int numberOfCrawlers = 4; CrawlConfig config = new CrawlConfig(); config.setPolitenessDelay(250); config.setMaxPagesToFetch(100); PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.addSeed("http://www.gdgistanbul.com"); controller.start(myCrawler.class, numberOfCrawlers);
Demo - Jsoup (1/2)Jsoup : nice way to do HTML Parsing in Java
● scrape and parse HTML from a URL, file, or string● find and extract data, using DOM traversal or CSS selectors● manipulate the HTML elements, attributes, and text
Demo - Jsoup (2/2)Document doc = Jsoup.connect("http://en.wikipedia.org/").get();Elements newsHeadlines = doc.select("#mp-itn b a");
String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>";Document doc = Jsoup.parse(html);
Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {
String linkHref = link.attr("href");String linkText = link.text();
}Elements links = doc.select("a[href]");Elements media = doc.select("[src]");
Where to Use
● Search Engines (GoogleBot)● Aggregators
○ Data aggregator○ News aggregator○ Review aggregator○ Search aggregator○ Social network aggregation○ Video aggregator
● Kaarun Product Collector
www.kaarun.com
All Friends
Products for each Facebook Like
cyesilkaya.wordpress.com & @cuneytykaya & tr.linkedin/cuneyt.yesilkaya
Teşekkürler...