To Scrape Data using python we are using BeautifulSoup python Package
!pip install beautifulsoup4
As a first step we have to import the packages and html page that we need to scrape. In here I have used some static HTML content which was customized to scrape the data.
#imports
import requests
from bs4 import BeautifulSoup
#html
HTML Sample
Doing Data Science with Python
from IPython.core.display import display, HTML
display(HTML(html_string))
ps=BeautifulSoup(html_string,"lxml")
print(ps)
#use Parameter name to select by tag name
body=ps.find(name="body")
print(body)
# use text attribute to get the content of the tag
print(body.find(name="h1").text)
# find first element by using .text its restricting the HTML tag
print(body.find(name="p").text)
# find all elements
print(body.findAll(name="p"))
# get only the contents of "p" elements Loop through each element
for p in body.findAll(name="p"):
print(p.text)
# add attributes in selection process
print(body.find(name="p", attrs={"id":"description"}))
#get the data contain in the table
body=ps.find(name="body")
module_table=body.find(name="table",attrs={"id":"module"})
for row in module_table.findAll(name="tr") [1:]:
title=row.findAll(name="td")[0].text
duration=row.findAll(name="td") [1].text
print (title,duration)
As a first step we have to import the packages and html page that we need to scrape. In here I have used some static HTML content which was customized to scrape the data.
import requests
from bs4 import BeautifulSoup
Doing Data Science with Python
This will help to perform various data science activitied using python
Modules
Title | Duration in minutes |
---|---|
Getting Started | 20 |
Setting Up Environment | 40 |
Extracting Data | 30 |
Exploring and Processing Data | 45 |
Building Productive Model | 45 |
To View the HTML using beautifulsoup we can use below code-lines and execute on python executor
display(HTML(html_string))
To Print the HTML using beautifulsoup we can use below code-lines and execute on python executor
print(ps)
Find and extract content by HTML tags
body=ps.find(name="body")
print(body)
Extract the value referred in the HTML tag
print(body.find(name="h1").text)
# find first element by using .text its restricting the HTML tag
print(body.find(name="p").text)
# find all elements
print(body.findAll(name="p"))
# get only the contents of "p" elements Loop through each element
for p in body.findAll(name="p"):
print(p.text)
# add attributes in selection process
print(body.find(name="p", attrs={"id":"description"}))
#get the data contain in the table
body=ps.find(name="body")
module_table=body.find(name="table",attrs={"id":"module"})
for row in module_table.findAll(name="tr") [1:]:
title=row.findAll(name="td")[0].text
duration=row.findAll(name="td") [1].text
print (title,duration)
Comments
Post a Comment