261 lines
5.8 KiB
Text
261 lines
5.8 KiB
Text
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Web scraping avec Python\n",
|
|
"## RoboBrowser\n",
|
|
"\n",
|
|
"- RoboBrowser est une librairie python qui permet de simuler un comportement de navigation dans un navigateur web.\n",
|
|
"- Elle permet aussi d'extraire des éléments précis d'information d'une page et de les structurer. \n",
|
|
"- C'est la plus populaire qui combine les avantages de `beautifulsoup` avec la flexibilité de `requests`. \n",
|
|
"- Elle est disponible pour Python 2.7 et Python 3.6"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import re\n",
|
|
"from robobrowser import RoboBrowser"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"On crée un objet navigateur et on navigue vers une première URL"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"browser = RoboBrowser(history=True,parser=\"lxml\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"browser.open('https://meteo.gc.ca/')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Ouverture d'un formulaire à partir d'une de ses propriétés (ici, son identifiant unique)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"form = browser.get_form(id=\"cityjump\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"En affichant le formulaire, on peut voir les différents champs disponibles, ainsi que les valeurs par défauts qui sont attribuées, si applicable"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"<RoboForm city=, lang=f, unit=>"
|
|
]
|
|
},
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"form"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"En utilisant la propriété `value`, on peut remplir le formulaire"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"form['city'].value = 'Québec'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"On envoie ensuite le formulaire, comme si on cliquait sur le bouton d'envoi"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"browser.submit_form(form)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'dl.mrgn-bttm-0'"
|
|
]
|
|
},
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"\"dl.mrgn-bttm-0\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"On obtient le résultat de la page suivante dans le navigatuer. À partir de ce résultat, on peut extraire différents éléments en utilisant le sélecteur `CSS`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"location_xml = browser. \\\n",
|
|
"select('.col-sm-10')[0]. \\\n",
|
|
"select('dl.mrgn-bttm-0')[0]. \\\n",
|
|
"select('dd.mrgn-bttm-0')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Lorsqu'un motif se répète, on préfère alors créer une fonction qui va extraire l'élément selon des paramètres. La librairie RoboBrowser ne gere pas les noeuds enfants de la structure XML (sélecteur `nth_child()`), mais on peut utiliser les listes de Python pour répliquer un comportement similaire."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"div.div-column:nth-child(4) > div:nth-child(2) > p:nth-child(2) > span:nth-child(1)\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def temperature_xml(jour): \n",
|
|
" return browser. \\\n",
|
|
" select('div.div-column')[jour]. \\\n",
|
|
" select('div')[1]. \\\n",
|
|
" select('p.mrgn-bttm-0')[0]. \\\n",
|
|
" select('span')[0]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"On peut assembler les données extraites dans un dictionnaire python et les utiliser dans son application."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'city': 'Aéroport int. Lesage de Québec',\n",
|
|
" 'date': '13h00 HNE le samedi 9 décembre 2017',\n",
|
|
" 'temperature': ['-1°C\\n',\n",
|
|
" '-2°C\\n',\n",
|
|
" '-13°C\\n',\n",
|
|
" '-4°C\\n',\n",
|
|
" '-8°C\\n',\n",
|
|
" '-11°C\\n',\n",
|
|
" '-11°C\\n']}"
|
|
]
|
|
},
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"{'city': location_xml[0].text, \n",
|
|
" 'date': location_xml[1].text,\n",
|
|
" 'temperature': [temperature_xml(i).text for i in range(0,7)]}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|