linuq-python-web-scraping-9.../Scraping Environnement Canada.ipynb

262 lines
5.8 KiB
Text
Raw Normal View History

2017-12-17 06:02:37 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Web scraping avec Python\n",
"## RoboBrowser\n",
"\n",
"- RoboBrowser est une librairie python qui permet de simuler un comportement de navigation dans un navigateur web.\n",
"- Elle permet aussi d'extraire des éléments précis d'information d'une page et de les structurer. \n",
"- C'est la plus populaire qui combine les avantages de `beautifulsoup` avec la flexibilité de `requests`. \n",
"- Elle est disponible pour Python 2.7 et Python 3.6"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"from robobrowser import RoboBrowser"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On crée un objet navigateur et on navigue vers une première URL"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"browser = RoboBrowser(history=True,parser=\"lxml\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"browser.open('https://meteo.gc.ca/')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ouverture d'un formulaire à partir d'une de ses propriétés (ici, son identifiant unique)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"form = browser.get_form(id=\"cityjump\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"En affichant le formulaire, on peut voir les différents champs disponibles, ainsi que les valeurs par défauts qui sont attribuées, si applicable"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<RoboForm city=, lang=f, unit=>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"form"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"En utilisant la propriété `value`, on peut remplir le formulaire"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"form['city'].value = 'Québec'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On envoie ensuite le formulaire, comme si on cliquait sur le bouton d'envoi"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"browser.submit_form(form)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'dl.mrgn-bttm-0'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"dl.mrgn-bttm-0\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On obtient le résultat de la page suivante dans le navigatuer. À partir de ce résultat, on peut extraire différents éléments en utilisant le sélecteur `CSS`."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"location_xml = browser. \\\n",
"select('.col-sm-10')[0]. \\\n",
"select('dl.mrgn-bttm-0')[0]. \\\n",
"select('dd.mrgn-bttm-0')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lorsqu'un motif se répète, on préfère alors créer une fonction qui va extraire l'élément selon des paramètres. La librairie RoboBrowser ne gere pas les noeuds enfants de la structure XML (sélecteur `nth_child()`), mais on peut utiliser les listes de Python pour répliquer un comportement similaire."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"div.div-column:nth-child(4) > div:nth-child(2) > p:nth-child(2) > span:nth-child(1)\""
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def temperature_xml(jour): \n",
" return browser. \\\n",
" select('div.div-column')[jour]. \\\n",
" select('div')[1]. \\\n",
" select('p.mrgn-bttm-0')[0]. \\\n",
" select('span')[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On peut assembler les données extraites dans un dictionnaire python et les utiliser dans son application."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'city': 'Aéroport int. Lesage de Québec',\n",
" 'date': '13h00 HNE le samedi 9 décembre 2017',\n",
" 'temperature': ['-1°C\\n',\n",
" '-2°C\\n',\n",
" '-13°C\\n',\n",
" '-4°C\\n',\n",
" '-8°C\\n',\n",
" '-11°C\\n',\n",
" '-11°C\\n']}"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{'city': location_xml[0].text, \n",
" 'date': location_xml[1].text,\n",
" 'temperature': [temperature_xml(i).text for i in range(0,7)]}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}