python week 1.pdf

Upload: tuxracersv

Post on 04-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Python Week 1.pdf

    1/8

    Python For Web Scraping - Week 1

    Andrew Hall

    October 6, 2010

    1 Environment

    Two easy ways to execute Python code:

    1. Open the Terminal (in OS X) or the command prompt in Windows, and type Python.This will bring you into the interactive Python environment; you can type in commandsand press enter to execute.

    2. Put your code in a text file, save it with the extension .py, and execute it by typing pythonmyfile.py in the Terminal.

    2 Hello, World

    print hello, world

    Or...

    string = hello, world

    print string

    Or...

    string = hello, world

    print string[0:5] + string[5:]

    Or...

    string = hello, world

    for char in string: print char

  • 8/13/2019 Python Week 1.pdf

    2/8

    Andrew Hall, Department of Government, Harvard University

    3 Variables

    Like most (all?) programming languages, Python lets (requires?) you store values as variables.In Python, unlike in many other languages you might have seen, you do not have to tell it whatkind of variable you are creating:

    >>> num = 9

    >>> type(num)

    >>> num

    9

    >>> num + 2

    11

    >>> num + pwn

    Traceback (most recent call last):

    File , line 1, in

    TypeError: unsupported operand type(s) for +: int and str

    Variables can refer to each other or themselves, and this is important for writing programs.For example, you may want to set up a counter that keeps track of how many times you havecarried out an operation (e.g. how many items you have added to a list).

    >>> count = 0

    >>> count = count + 1

    >>> count

    1

    Its important to keep track of what type of variable Python has created for you, though.

    >>> numerator = 3

    >>> denominator = 2

    >>> numerator/denominator

    1

    >>> numerator = float(3)

    >>> denominator = float(2)

    >>> numerator/denominator

    1.5

    4 Strings

    Working with strings is essential for web-scraping. This is arguably the most important con-cept to understand.

    A string is a collection of characters, like:

    1. string

    Please send corrections to [email protected]

  • 8/13/2019 Python Week 1.pdf

    3/8

    Andrew Hall, Department of Government, Harvard University

    2. a string

    3.

    4. 12345

    As we saw before, we can save a string to a variable.

    >>> string = length

    >>> type(string)

    >>> len(string)

    6

    By and large, the data you pull of the web will be formatted as strings. So you want to knowhow to manipulate, analyze, and store them.

    4.1 Slicing Strings

    >>> string = hello, world

    >>> print string[0]

    h

    >>> print string[11]

    d

    >>> print string[12]

    Traceback (most recent call last):

    File , line 1, in

    IndexError: string index out of range

    >>> print string[-1]

    d

    A common challenge with web-scraping is that you get a string containing a date and value youwant, like Jul 4 2009 20

    1. How do you get the month?

    >>> line = Jul 4 2009, 20

    >>> line[0:4]

    Jul

    2. How do you get the day?>>> line[4]

    4

    3. How do you get the year?

    >>> line[6:10]

    2009

    Please send corrections to [email protected]

  • 8/13/2019 Python Week 1.pdf

    4/8

    Andrew Hall, Department of Government, Harvard University

    4. How do you get the value?

    >>> line[-2:]

    20

    Note, finally, that strings are immutable,meaning that you cant modify one youve created:

    >>> string = test

    >>> string[2] = p

    Traceback (most recent call last):

    File , line 1, in

    TypeError: str object does not support item assignment

    4.2 String Methods

    Python provides a whole suite of really helpful functions for working with strings. You should

    check out the Python documentation online to find all of them. Ive picked a couple to showhere as a preview.

    4.2.1 Convert to all upper or lower case

    >>> string = This is a String

    >>> string.upper()

    THIS IS A STRING

    >>> string.lower()

    this is a string

    This is particularly useful if youre trying to match values; say for example you have two dif-ferent datasets with country names. In one data set they might say Afghanistan and in theother they might have AFGHANISTAN. If you try to match the two data sets as they are,you wont find the match - you need to convert to all upper-case (or all lower-case) beforecomparing.

    4.2.2 Strip out whitespace at beginning and end of strings

    >>> string = annoying spaces

    >>> string

    annoying spaces

    >>> string.strip()

    annoying spaces

    This comes up a lot. Oftentimes when you scrape data, its weirdly formatted. Spaces at thebeginning and end crop up, and they mess up comparisons and other stuff.

    Please send corrections to [email protected]

  • 8/13/2019 Python Week 1.pdf

    5/8

    Andrew Hall, Department of Government, Harvard University

    4.2.3 Searching for a substring

    >>> string = what to look for

    >>> string.find(what)

    0

    >>> string.find(t)3

    >>> string.find(z)

    -1

    >>> string.find(or)

    14

    4.2.4 Dealing with numbers that have been read in as strings

    >>> number = 8

    >>> number.isdigit()

    True

    >>> number = eight

    >>> number.isdigit()

    False

    >>> number + 2

    Traceback (most recent call last):

    File , line 1, in

    TypeError: cannot concatenate str and int objects

    >>> int(number) + 2

    Traceback (most recent call last):

    File , line 1, in

    ValueError: invalid literal for int() with base 10: eight

    >>> number = 8

    >>> int(number) + 2

    10

    5 Lists

    Lists are a crucial data type. They let you store groups of values for later use.

    >>> list = [1,2,3,4]

    >>> list

    [1, 2, 3, 4]

    >>> list[0]

    1

    >>> list[3]

    4

    Please send corrections to [email protected]

  • 8/13/2019 Python Week 1.pdf

    6/8

    Andrew Hall, Department of Government, Harvard University

    >>> list[4]

    Traceback (most recent call last):

    File , line 1, in

    IndexError: list index out of range

    >>> list[-1]

    4

    Lists are a natural way to store the data you read in from the web. For example, suppose youare reading in the names of the senators that voted for a bill; for each Senator on the web-pageyou are reading, you add the senator to the list.

    >>> senators = []

    >>> senators.append(Daniel Webster)

    >>> senators.append(Hillary Clinton)

    >>> senators

    [Daniel Webster, Hillary Clinton]

    You dont have to append things onto the end, though - you can insert them wherever youplease:

    >>> senators.insert(0, Tom Coburn)

    >>> senators

    [Tom Coburn, Daniel Webster, Hillary Clinton]

    >>> senators.insert(1, Joe Lieberman)

    >>> senators

    [Tom Coburn, Joe Lieberman, Daniel Webster, Hillary Clinton]

    There are tons of other important things to do with lists, so I encourage you to check out thePython documentation.

    6 For Loops

    For Loops are a must for basically any programming, and certainly for web scraping. Supposeyou have a set of URLs you want to scrape; you need a way to tell your program to iterateover each of the URLs. This is one of a million situations in which a For Loop gets the jobdone.

    The general idea with a For Loop in any language is to take a variable and a range of values, and

    set that variable to each of the values in the given range, one by one. So if I say (in psuedocode),for i in (1,2,3,4) I mean first, set i = 1 and do something, then set i = 2 and do the same thingover again, and keep doing this until after you do it for i=4.

    >>> for i in range(0,10): print i

    ...

    0

    1

    2

    Please send corrections to [email protected]

  • 8/13/2019 Python Week 1.pdf

    7/8

    Andrew Hall, Department of Government, Harvard University

    3

    4

    5

    6

    7

    89

    Python lets you abstract away your For Loops much more than most languages. For example,suppose you have a list of baseball teams:

    >>> teams = [Red Sox, Yankees, Rays, Blue Jays]

    >>> for team in teams: print team

    ...

    Red Sox

    Yankees

    Rays

    Blue Jays

    Python magically knows (thanks to the in) keyword that you want it to loop through each ofthe elements in the list you give it.

    7 Logical Tests

    Equally as important as For Loops are If Statements. We use If Statements to check values,e.g. to see if a certain Senator is in a list of nay votes.

    >>> nays = [Coburn, Specter, DeMint]

    >>> if Coburn in nays: print No!

    ...

    No!

    We can check all sorts of things. Maybe we want to know whether the bill passed, given thelist of nay voters.

    >> if len(nays) > 50: print Bill does not pass

    ...

    >>> if len(nays)

  • 8/13/2019 Python Week 1.pdf

    8/8

    Andrew Hall, Department of Government, Harvard University

    line! So we need to write a script. A script is just a list of commands for Python to execute,which you save in a text file. Its like giving the computer a list and letting it do the command-line entries for you.

    For example, take a look at this script. Note that lines that start with the pound sign are com-ment lines - these are not executed by Python, and are little notes we can leave for ourselvesso we understand what we were thinking when we wrote the code.

    #FirstScript.py

    #Our first script!

    list = [3, 7, 3, 5, 1, 2]

    m a x = 0

    for num in list:

    if num > max: max = num

    print Max is: + str(max)

    To run this script, we open the Terminal (in OS X), make sure we are in the same directory asthe script file (or write out the entire path of the file name), and go:

    Tue Oct 05 17:41:12 559 $ python FirstScript.py

    Max is: 7

    Its important to note Pythons rules of syntax, which did not come up much when we wereusing the interactive interpreter, but are unavoidable when writing a script.

    Note, for example, the colon at the end of the For Loop. This tells Python that it is goinginside of a For Loop. Note also that the next line is indented exactly 4 spaces. Each line inside

    the For Loop must be indented four spaces, so that Python knows where the loop continuesand where it ends (it ends when it gets to a line that is NOT indented four spaces).

    Likewise, there is a colon at the end of the If Statement. In this case, I have left the result ofthe If Statement on the same line as the If Statement, meaning I dont have to indent. Thisworks as long as the result of the If Statement is only one line. If I needed more than onething to execute, Id need to put them on new lines and indent them four spaces.

    Fortunately, any good text editor will know youre writing a Python file, and thus will make theTab key indent exactly four spaces.

    9 Next Topics

    1. Functions

    2. Modules

    3. Regular Expressions

    4. Other stuff?

    Please send corrections to [email protected]