Getting the nth element using BeautifulSoup

From a large table I want to read rows 5, 10, 15, 20 ... using BeautifulSoup. How do I do this? Is findNextSibling and an incrementing counter the way to go?

5 Answers

You could also use findAll to get all the rows in a list and after that just use the slice syntax to access the elements that you need:

rows = soup.findAll('tr')[4::5]
2

This can be easily done with select in beautiful soup if you know the row numbers to be selected. (Note : This is in bs4)

row = 5
while true element = soup.select('tr:nth-of-type('+ row +')') if len(element) > 0: # element is your desired row element, do what you want with it row += 5 else: break
1

As a general solution, you can convert the table to a nested list and iterate...

import BeautifulSoup
def listify(table): """Convert an html table to a nested list""" result = [] rows = table.findAll('tr') for row in rows: result.append([]) cols = row.findAll('td') for col in cols: strings = [_string.encode('utf8') for _string in col.findAll(text=True)] text = ''.join(strings) result[-1].append(text) return result
if __name__=="__main__": """Build a small table with one column and ten rows, then parse into a list""" htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>""" soup = BeautifulSoup.BeautifulSoup(htstring) for idx, ii in enumerate(listify(soup)): if ((idx+1)%5>0): continue print ii

Running that...

[mpenning@Bucksnort ~]$ python testme.py
['foo5']
['foo10']
[mpenning@Bucksnort ~]$

Another option, if you prefer raw html...

"""Build a small table with one column and ten rows, then parse it into a list"""
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>"""
result = [html_tr for idx, html_tr in enumerate(soup.findAll('tr')) \ if (idx+1)%5==0]
print result

Running that...

[mpenning@Bucksnort ~]$ python testme.py
[<tr> <td>foo5</td> </tr>, <tr> <td>foo10</td> </tr>]
[mpenning@Bucksnort ~]$

Here's how you could scrape every 5th distribution link on this Wikipedia page with gazpacho:

from gazpacho import Soup
url = ""
soup = Soup.get(url)
a_tags = soup.find("a", {"href": "distribution"})
links = ["" + a.attrs["href"] for a in a_tags]
links[4::5] # start at 0,1,2,3,**4** and stride by 5

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

You Might Also Like