From a large table I want to read rows 5, 10, 15, 20 ... using BeautifulSoup. How do I do this? Is findNextSibling and an incrementing counter the way to go?
5 Answers
You could also use findAll to get all the rows in a list and after that just use the slice syntax to access the elements that you need:
rows = soup.findAll('tr')[4::5] 2 This can be easily done with select in beautiful soup if you know the row numbers to be selected. (Note : This is in bs4)
row = 5
while true element = soup.select('tr:nth-of-type('+ row +')') if len(element) > 0: # element is your desired row element, do what you want with it row += 5 else: break 1 As a general solution, you can convert the table to a nested list and iterate...
import BeautifulSoup
def listify(table): """Convert an html table to a nested list""" result = [] rows = table.findAll('tr') for row in rows: result.append([]) cols = row.findAll('td') for col in cols: strings = [_string.encode('utf8') for _string in col.findAll(text=True)] text = ''.join(strings) result[-1].append(text) return result
if __name__=="__main__": """Build a small table with one column and ten rows, then parse into a list""" htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>""" soup = BeautifulSoup.BeautifulSoup(htstring) for idx, ii in enumerate(listify(soup)): if ((idx+1)%5>0): continue print iiRunning that...
[mpenning@Bucksnort ~]$ python testme.py
['foo5']
['foo10']
[mpenning@Bucksnort ~]$ Another option, if you prefer raw html...
"""Build a small table with one column and ten rows, then parse it into a list"""
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>"""
result = [html_tr for idx, html_tr in enumerate(soup.findAll('tr')) \ if (idx+1)%5==0]
print resultRunning that...
[mpenning@Bucksnort ~]$ python testme.py
[<tr> <td>foo5</td> </tr>, <tr> <td>foo10</td> </tr>]
[mpenning@Bucksnort ~]$ Here's how you could scrape every 5th distribution link on this Wikipedia page with gazpacho:
from gazpacho import Soup
url = ""
soup = Soup.get(url)
a_tags = soup.find("a", {"href": "distribution"})
links = ["" + a.attrs["href"] for a in a_tags]
links[4::5] # start at 0,1,2,3,**4** and stride by 5