Quantcast
Channel: MobileRead Forums - Calibre
Viewing all articles
Browse latest Browse all 31491

captureing all articles but all under the first section

$
0
0
Here is how the webpage looks like

Code:

<div class='module'>

<h3>Section1</h3>
……
<li>
<h4>articles links and article titles</h4>
</li>
……

<h3>Section2</h3>

……
<li>
<h4>articles links and article titles</h4>
</li>
……

So I tried to parse it as:
Code:


        for section in soup.findAll('div', attrs={'class':['module']}):
            h3 = section.find('h3')
            section_title = self.tag_to_string(h3)
            self.log('Found section:', section_title)
            articles = []
            for post in section.findAll('li'):
                h4 = post.findAll(['h4'])
                a = post.find('a', href=True)
                title = self.tag_to_string(a)
                url = a['href']

But it turned out that though all the articles were fetched correctly, they all end up in the first section. (The other section names are not fetched, that is.) But I am at a loss what to do, since all section names are included in "h3", and unlike webpage of built-in recipes, <div class='module'> appears only before the first section, not every section (which I think explains the failure). Can anyone help me out? Just a quick answer is appreciated.

Viewing all articles
Browse latest Browse all 31491

Trending Articles