发布于2022年10月15日2年前 1.BeautifulSoup简介 BeautifulSoup是一个可以从HTML或XML文件中提取数据的python库;它能够通过转换器实现惯用的文档导航、查找、修改文档的方式。 BeautifulSoup是一个基于re开发的解析库,可以提供一些强大的解析功能;使用BeautifulSoup能够提高提取数据的效率与爬虫开发效率。 2.BeautifulSoup总览 构建文档树 BeautifulSoup进行文档解析是基于文档树结构来实现的,而文档树则是由BeautifulSoup中的四个数据对象构建而成的。 文档树对象 描述 Tag 标签; 访问方式:soup.tag;属性:tag.name(标签名),tag.attrs(标签属性) Navigable String 可遍历字符串; 访问方式:soup.tag.string BeautifulSoup 文档全部内容,可作为Tag对象看待; 属性:soup.name(标签名),soup.attrs(标签属性) Comment 标签内字符串的注释; 访问方式:soup.tag.string import lxml import requests from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ #1、BeautifulSoup对象 soup = BeautifulSoup(html,'lxml') print(type(soup)) #2、Tag对象 print(soup.head,'\n') print(soup.head.name,'\n') print(soup.head.attrs,'\n') print(type(soup.head)) #3、Navigable String对象 print(soup.title.string,'\n') print(type(soup.title.string)) #4、Comment对象 print(soup.a.string,'\n') print(type(soup.a.string)) #5、结构化输出soup对象 print(soup.prettify()) 遍历文档树 BeautifulSoup之所以将文档转为树型结构,是因为树型结构更便于对内容的遍历提取。 向下遍历方法 描述 tag.contents tag标签子节点 tag.children tag标签子节点,用于循环遍历子节点 tag.descendants tag标签子孙节点,用于循环遍历子孙节点 向上遍历方法 描述 tag.parent tag标签父节点 tag.parents tag标签先辈节点,用于循环遍历先别节点 平行遍历方法 描述 tag.next_sibling tag标签下一兄弟节点 tag.previous_sibling tag标签上一兄弟节点 tag.next_siblings tag标签后续全部兄弟节点 tag.previous_siblings tag标签前序全部兄弟节点 import requests import lxml import json from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html,'html.parser') #1、向下遍历 print(soup.p.contents) print(list(soup.p.children)) print(list(soup.p.descendants)) #2、向上遍历 print(soup.p.parent.name,'\n') for i in soup.p.parents: print(i.name) #3、平行遍历 print('a_next:',soup.a.next_sibling) for i in soup.a.next_siblings: print('a_nexts:',i) print('a_previous:',soup.a.previous_sibling) for i in soup.a.previous_siblings: print('a_previouss:',i) 搜索文档树 BeautifulSoup提供了许多搜索方法,能够便捷地获取我们需要的内容。 遍历方法 描述 soup.find_all( ) 查找所有符合条件的标签,返回列表数据 soup.find 查找符合条件的第一个个标签,返回字符串数据 soup.tag.find_parents() 检索tag标签所有先辈节点,返回列表数据 soup.tag.find_parent() 检索tag标签父节点,返回字符串数据 soup.tag.find_next_siblings() 检索tag标签所有后续节点,返回列表数据 soup.tag.find_next_sibling() 检索tag标签下一节点,返回字符串数据 soup.tag.find_previous_siblings() 检索tag标签所有前序节点,返回列表数据 soup.tag.find_previous_sibling() 检索tag标签上一节点,返回字符串数据 需要注意的是,因为class是python的保留关键字,若要匹配标签内class的属性,需要特殊的方法,有以下两种: 在attrs属性用字典的方式进行参数传递 BeautifulSoup自带的特别关键字class_ import requests import lxml import json from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html,'html.parser') #1、find_all( ) print(soup.find_all('a')) #检索标签名 print(soup.find_all('a',id='link1')) #检索属性值 print(soup.find_all('a',class_='sister')) print(soup.find_all(text=['Elsie','Lacie'])) #2、find( ) print(soup.find('a')) print(soup.find(id='link2')) #3 、向上检索 print(soup.p.find_parent().name) for i in soup.title.find_parents(): print(i.name) #4、平行检索 print(soup.head.find_next_sibling().name) for i in soup.head.find_next_siblings(): print(i.name) print(soup.title.find_previous_sibling()) for i in soup.title.find_previous_siblings(): print(i.name) CSS选择器 BeautifulSoup选择器支持绝大部分的CSS选择器,在Tag或BeautifulSoup对象的.select( )方法中传入字符串参数,即可使用CSS选择器找到Tag。 常用HTML标签: HTML标题:<h> </h> HTML段落:<p> </p> HTML链接:<a href='httts://www.baidu.com/'> this is a link </a> HTML图像:<img src='Ai-code.jpg',width='104',height='144' /> HTML表格:<table> </table> HTML列表:<ul> </ul> HTML块:<div> </div> import requests import lxml import json from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html,'html.parser') print('标签查找:',soup.select('a')) print('属性查找:',soup.select('a[id="link1"]')) print('类名查找:',soup.select('.sister')) print('id查找:',soup.select('#link1')) print('组合查找:',soup.select('p #link1')) 爬取图片实例 import requests from bs4 import BeautifulSoup import os def getUrl(url): try: read = requests.get(url) read.raise_for_status() read.encoding = read.apparent_encoding return read.text except: return "连接失败!" def getPic(html): soup = BeautifulSoup(html, "html.parser") all_img = soup.find('ul').find_all('img') for img in all_img: src = img['src'] img_url = src print(img_url) root = "F:/Pic/" path = root + img_url.split('/')[-1] print(path) try: if not os.path.exists(root): os.mkdir(root) if not os.path.exists(path): read = requests.get(img_url) with open(path, "wb")as f: f.write(read.content) f.close() print("文件保存成功!") else: print("文件已存在!") except: print("文件爬取失败!") if __name__ == '__main__': html_url=getUrl("https://findicons.com/search/nature") getPic(html_url)
创建帐户或登录后发表意见