3.5.2 提取数据_Python网络爬虫技术与实战-QQ阅读中文都市网

上QQ阅读APP看书，第一时间看更新

3.5.2　提取数据

在3.5.1节，我们介绍了Beautiful Soup库的基本使用方法，本节我们将介绍节点选择、关联选择以及方法选择器；注意，这里将HTML代码另存为在同目录下的test.html文件。

1.节点选择

直接调用节点的名称就可以选择节点元素，再调用string属性就可以得到节点内的文本了，这种选择方式速度非常快。如果单个节点结构层次非常清晰，则可以选用这种方式来解析。

（1）获取文本值

通过调用String属性可以得到节点内各标签的文本内容，如以下示例所示。

【例3-40】获取string实例

1  from bs4 import BeautifulSoup
2  soup = BeautifulSoup(open("test.html"),"lxml")
3  print(soup.title)
4  print(type(soup.title))
5  print(soup.title.string)
6  print(soup.p.string)

首先打印输出title节点的选择结果，输出结果正是title节点加里面的文字内容。接着输出它的类型，是bs4.element.Tag类型，这是Beautiful Soup库中一个重要的数据结构。经过选择器选择后，选择结果都是这种Tag类型。Tag具有一些属性，比如string属性，调用该属性可以得到节点的文本内容，所以接下来的输出结果正是节点的文本内容。

运行结果如下：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
The Dormouse's story

如果想通过string获取多个内容，只需要遍历获取，比如下面的例子：

for string in soup.strings:
    print(repr(string))

输出结果如下：

"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'

输出的字符串中可能包含很多空格或空行，使用stripped_strings可以去除多余空白内容，代码如下：

for string in soup.stripped_strings:
    print(repr(string))

运行结果如下：

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'

（2）获取名称

我们可以利用name属性获取节点的名称。这里还是以上面的文本为例，选取p节点，然后调用name属性就可以得到节点名称：

print(soup.p.name)

运行结果如下：

（3）获取属性

HTML中的节点有不同的属性，例如该代码里的p节点具有class和name属性，可以使用attrs获取该节点的属性。

print(soup.p.attrs)

运行结果如下：

{'class': ['title'], 'name': 'dromouse'}

可以看到返回的结果是字典类型，我们只需通过attrs['name']即可获取name的属性值。

2.关联选择

使用Python爬虫库Beautiful Soup遍历文档树并对标签进行操作，Beautiful Soup提供了许多操作和遍历子节点的属性。一个标签可能包含多个字符串或者其他标签，这些都是这个标签的子节点。很多时候我们无法直接定位到某个元素，因此可以先定位它的父元素，通过父元素来找子元素就比较容易，如图3-4所示。

图3-4　父节点与子节点

（1）子节点和子孙节点

Tag的.content属性可以将Tag的子节点以列表的方式输出：

print (soup.head.contents)

运行结果如下：

[<title>The Dormouse's story</title>]

输出方式为列表，我们可以用列表索引来获取它的某一个元素：

print (soup.head.contents[0])

运行结果如下：

<title>The Dormouse's story</title>

同样，我们可以调用.children属性得到相应的结果，先看下面的代码：

print (soup.head.children)

运行结果如下：

<list_iterator object at 0x000001BE3F4CB7F0>

它返回的不是一个list，不过我们可以通过遍历来获取所有子节点。我们打印输出.children，可以发现它是一个list生成器对象，于是我们可以遍历输出里面的内容：

for item in  soup.body.children:
    print (item)

运行结果如下：

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names 
were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

（2）所有子节点

.contents和.children属性仅包含Tag的直接子节点，.descendants属性则可以对所有Tag的子孙节点进行递归循环，和.children类似，要获取其中的内容，我们需要对其进行遍历：

for item in  soup.descendants:
    print (item)

查看运行结果，可以发现，所有的节点都被打印出来了：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names 
were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names 
were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story
<p class="story">Once upon a time there were three little sisters; and their names 
were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie;
and they lived at the bottom of a well.
<p class="story">...</p>
...

（3）父节点和祖先节点

如果要获取某个节点元素的父节点，可以调用.parent属性，依旧使用上文的HTML代码，具体例子如下：

p = soup.p
print (p.parent)

运行结果如下：

<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names 
were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body>

这里我们选择的是第一个p节点的父节点元素。很明显，它的父节点是body节点，输出结果便是body节点及其内部的内容。

需要注意的是，这里输出的仅是p节点的直接父节点，而没有再向外寻找父节点的祖先节点。如果想获取所有的祖先节点，可以调用.parents属性：

content = soup.p
for parent in  content.parents:
    print (parent)

（4）兄弟节点和全部兄弟节点

兄弟节点可以理解为和本节点处在统一级的节点，.next_sibling属性获取了该节点的下一个兄弟节点，.previous_sibling属性获取了该节点的上一个兄弟节点，如果节点不存在，则返回None。

注意，实际文档中的Tag的.next_sibling和.previous_sibling属性通常是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行。

print (soup.p.next_sibling)
print (soup.p.prev_sibling)
print (soup.p.next_sibling.next_sibling)

通过.next_siblings和.previous_siblings属性可以对当前节点的兄弟节点迭代输出：

for sibling in soup.a.next_siblings:
    print(repr(sibling))

输出结果如下：

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'

需要注意的是，在节点选择中，如果返回结果是单个节点，那么可以直接调用string、attrs等属性来获得其文本和属性；而如果返回结果是多个节点的生成器，则可以转为列表后取出某个元素，然后再调用string、attrs等属性来获取其对应节点的文本和属性。

3.方法选择器

上文所介绍的节点选择方法是通过属性进行选择的，Beautiful Soup还为我们提供了搜索文档树的方法，比如find_all()和find()等，调用它们，然后传入相应的参数，就可以灵活搜索了。

（1）find_all(self,name=None,attrs={},recursive=True,text=None,limit=None,**kwargs)

find_all()方法搜索当前tag的所有tag子节点，并判断是否符合过滤器的条件。

1）按照tag（标签）搜索。

# 直接搜索名为tagname的tag，如find_all('head')
find_all(tagname)       
# 搜索在list中的tag，如find_all(['head', 'body'])
find_all(list)           
# 搜索在dict中的tag，如find_all({'head':True, 'body':True})
find_all(dict)     
# 搜索符合正则的tag，如find_all(re.compile('^p'))搜索以p开头的tag
find_all(re.compile(''))
# 搜索函数返回结果为true的tag，如find_all(lambda name: if len(name) == 1)搜
索长度为1的tag
find_all(lambda)
# 搜索所有tag   
find_all(True)

2）按照attrs（属性）搜索。

# 寻找id属性为xxx的
find_all('id'='xxx')
# 寻找id属性符合正则且algin属性为xxx的
find_all(attrs={'id':re.compile('xxx'), 'algin':'xxx'})
# 寻找有id属性但是没有algin属性的
find_all(attrs={'id':True, 'algin':None})

（2）find(name,attrs,recursive,text,**kwargs)

它与find_all()方法唯一的区别是，find_all()方法的返回结果是值包含一个元素的列表，而find()方法直接返回结果。这些参数与过滤器一样可以进行筛选处理。不同的参数过滤可以应用到以下情况：

1）查找标签，基于name参数；

2）查找文本，基于text参数；

3）基于正则表达式的查找；

4）查找标签的属性，基于attrs参数；

5）基于函数的查找。

（3）find_parents()、find_parent()

find_all()和find()只搜索当前节点的所有子节点，孙子节点find_parents()和find_parent()用来搜索当前节点的父辈节点，搜索方法与普通tag的搜索方法相同，搜索文档包含的内容。

（4）find_next_siblings()、find_next_sibling()

这两个方法通过.next_siblings属性来对当前tag的所有后面解析的兄弟tag节点进行迭代，find_next_siblings()方法返回所有符合条件的后面的兄弟节点，而find_next_sibling()只返回符合条件的后面的第一个tag节点。

（5）find_previous_siblings()、find_previous_sibling()

这两个方法通过.previous_siblings属性来对当前tag的前面解析的兄弟tag节点进行迭代，find_previous_siblings()方法返回所有符合条件的前面的兄弟节点，find_previous_sibling()方法返回第一个符合条件的前面的兄弟节点。

（6）find_all_next()、find_next()

这两个方法通过.next_elements属性来对当前tag之后的tag和字符串进行迭代，find_all_next()方法返回所有符合条件的节点，find_next()方法返回第一个符合条件的节点。

（7）find_all_previous()、find_previous()

这两个方法通过.previous_elements属性来对当前tag之前的tag和字符串进行迭代，find_all_previous()方法返回所有符合条件的节点，find_previous()方法返回第一个符合条件的节点。

注意　以上方法参数的用法与find_all()完全相同，原理也均类似。