from bs4 import BeautifulSoup
import requests
import csv
Discovering the Value of My Home: A Personal Journey into Ensenada’s Real Estate Market
Nestled in the heart of Ensenada, Baja Califronia, Mexico is an amazing port city that has always been a part of my life,(since Covid) I’ve built a home that is not just a place to live but a symbol of my dreams and aspirations. Yet, in the absence of a national Multi-Listing Service (MLS) in Mexico, determining the true worth of my home has been a challenge. This project is my quest to uncover the hidden value of my home, to understand its worth in the context of the Ensenada real estate market.
Why now? Why this project? The answer lies in the desire to bridge the gap between the physical and the digital, to bring clarity to the real estate landscape that has, until now, been shrouded in mystery. By leveraging the power of web scraping, I aim to dive into the digital archives of point2homes.com, a decent sized listing service, and build a accurate enough model to track values.
This journey is not just about numbers and statistics; it’s about uncovering the stories behind each property listing, the dreams and aspirations of those who call Ensenada home. And mostly understanding the market cycle about seeing how much cash I can make.
Join me as I embark on this exciting journey, as I dive into the depths of the Mexican real estate market, uncovering the hidden gems that lie within. Together, we will bring the real estate landscape of Ensenada into the light, making it accessible and understandable to all. And in doing so, we will uncover the true worth of my home, a testament to the beauty and potential of Ensenada.
First lets import the libraries we will be using
We will use bs4 to call the website and return all of the html which will be stored in memory. I just looked up my user agent and copied it from the internet.
also I will not bother to clean up and polish my notebooks. I really appreciate the thought process and the code tests when I am learning. I will not be using this for any other purpose than to teach others. So if you are reading this, I hope you are learning something.
Data Science is not a linear process its messy and you will make mistakes.
= f'https://www.point2homes.com/MX/Real-Estate-Listings/Baja-California/Ensenada.html'
URL = {
headers 'Accept': 'application/json, text/javascript',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
}
# Get the page
= requests.get(URL, headers=headers)
page = BeautifulSoup(page.content, 'html.parser') soup
# lets look at the reposnse form the website.
print(soup.prettify())
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>
Just a moment...
</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="noindex,nofollow" name="robots"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<style>
*{box-sizing:border-box;margin:0;padding:0}html{line-height:1.15;-webkit-text-size-adjust:100%;color:#313131}button,html{font-family:system-ui,-apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Helvetica Neue,Arial,Noto Sans,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji}@media (prefers-color-scheme:dark){body{background-color:#222;color:#d9d9d9}body a{color:#fff}body a:hover{color:#ee730a;text-decoration:underline}body .lds-ring div{border-color:#999 transparent transparent}body .font-red{color:#b20f03}body .big-button,body .pow-button{background-color:#4693ff;color:#1d1d1d}body #challenge-success-text{background-image:url()}body #challenge-error-text{background-image:url()}}body{display:flex;flex-direction:column;min-height:100vh}body.no-js .loading-spinner{visibility:hidden}body.no-js .challenge-running{display:none}body.dark{background-color:#222;color:#d9d9d9}body.dark a{color:#fff}body.dark a:hover{color:#ee730a;text-decoration:underline}body.dark .lds-ring div{border-color:#999 transparent transparent}body.dark .font-red{color:#b20f03}body.dark .big-button,body.dark .pow-button{background-color:#4693ff;color:#1d1d1d}body.dark #challenge-success-text{background-image:url()}body.dark #challenge-error-text{background-image:url()}body.light{background-color:transparent;color:#313131}body.light a{color:#0051c3}body.light a:hover{color:#ee730a;text-decoration:underline}body.light .lds-ring div{border-color:#595959 transparent transparent}body.light .font-red{color:#fc574a}body.light .big-button,body.light .pow-button{background-color:#003681;border-color:#003681;color:#fff}body.light #challenge-success-text{background-image:url()}body.light #challenge-error-text{background-image:url()}a{background-color:transparent;color:#0051c3;text-decoration:none;transition:color .15s ease}a:hover{color:#ee730a;text-decoration:underline}.main-content{margin:8rem auto;max-width:60rem;width:100%}.heading-favicon{height:2rem;margin-right:.5rem;width:2rem}@media (width <= 720px){.main-content{margin-top:4rem}.heading-favicon{height:1.5rem;width:1.5rem}}.footer,.main-content{padding-left:1.5rem;padding-right:1.5rem}.main-wrapper{align-items:center;display:flex;flex:1;flex-direction:column}.font-red{color:#b20f03}.spacer{margin:2rem 0}.h1{font-size:2.5rem;font-weight:500;line-height:3.75rem}.h2{font-weight:500}.core-msg,.h2{font-size:1.5rem;line-height:2.25rem}.body-text,.core-msg{font-weight:400}.body-text{font-size:1rem;line-height:1.25rem}@media (width <= 720px){.h1{font-size:1.5rem;line-height:1.75rem}.h2{font-size:1.25rem}.core-msg,.h2{line-height:1.5rem}.core-msg{font-size:1rem}}#challenge-error-text{background-image:url();padding-left:34px}#challenge-error-text,#challenge-success-text{background-repeat:no-repeat;background-size:contain}#challenge-success-text{background-image:url();padding-left:42px}.text-center{text-align:center}.big-button{border:.063rem solid #0051c3;border-radius:.313rem;font-size:.875rem;line-height:1.313rem;padding:.375rem 1rem;transition-duration:.2s;transition-property:background-color,border-color,color;transition-timing-function:ease}.big-button:hover{cursor:pointer}.captcha-prompt:not(.hidden){display:flex}@media (width <= 720px){.captcha-prompt:not(.hidden){flex-wrap:wrap;justify-content:center}}.pow-button{background-color:#0051c3;color:#fff;margin:2rem 0}.pow-button:hover{background-color:#003681;border-color:#003681;color:#fff}.footer{font-size:.75rem;line-height:1.125rem;margin:0 auto;max-width:60rem;width:100%}.footer-inner{border-top:1px solid #d9d9d9;padding-bottom:1rem;padding-top:1rem}.clearfix:after{clear:both;content:"";display:table}.clearfix .column{float:left;padding-right:1.5rem;width:50%}.diagnostic-wrapper{margin-bottom:.5rem}.footer .ray-id{text-align:center}.footer .ray-id code{font-family:monaco,courier,monospace}.core-msg,.zone-name-title{overflow-wrap:break-word}@media (width <= 720px){.diagnostic-wrapper{display:flex;flex-wrap:wrap;justify-content:center}.clearfix:after{clear:none;content:none;display:initial;text-align:center}.column{padding-bottom:2rem}.clearfix .column{float:none;padding:0;width:auto;word-break:keep-all}.zone-name-title{margin-bottom:1rem}}.loading-spinner{height:76.391px}.lds-ring{display:inline-block;position:relative}.lds-ring,.lds-ring div{height:1.875rem;width:1.875rem}.lds-ring div{animation:lds-ring 1.2s cubic-bezier(.5,0,.5,1) infinite;border:.3rem solid transparent;border-radius:50%;border-top-color:#313131;box-sizing:border-box;display:block;position:absolute}.lds-ring div:first-child{animation-delay:-.45s}.lds-ring div:nth-child(2){animation-delay:-.3s}.lds-ring div:nth-child(3){animation-delay:-.15s}@keyframes lds-ring{0%{transform:rotate(0)}to{transform:rotate(1turn)}}@media screen and (-ms-high-contrast:active),screen and (-ms-high-contrast:none){.main-wrapper,body{display:block}}
</style>
<meta content="375" http-equiv="refresh"/>
</head>
<body class="no-js">
<div class="main-wrapper" role="main">
<div class="main-content">
<noscript>
<div id="challenge-error-title">
<div class="h2">
<span id="challenge-error-text">
Enable JavaScript and cookies to continue
</span>
</div>
</div>
</noscript>
</div>
</div>
<script>
(function(){window._cf_chl_opt={cvId: '3',cZone: "www.point2homes.com",cType: 'managed',cNounce: '79609',cRay: '86dd627ae96a08cf',cHash: 'b71d3725db417b4',cUPMDTk: "\/MX\/Real-Estate-Listings\/Baja-California\/Ensenada.html?__cf_chl_tk=YKvzewEsRagBb8Lj2b3EMVrgYysUsTNYHxILZ0zsMLs-1712024897-0.0.1.1-1749",cFPWv: 'g',cTTimeMs: '1000',cMTimeMs: '375000',cTplV: 5,cTplB: 'cf',cK: "visitor-time",fa: "\/MX\/Real-Estate-Listings\/Baja-California\/Ensenada.html?__cf_chl_f_tk=YKvzewEsRagBb8Lj2b3EMVrgYysUsTNYHxILZ0zsMLs-1712024897-0.0.1.1-1749",md: "TB_yun9yL4Vza0_M5FkRtSVrXm1dH3Bqur2LMwVUxv0-1712024897-1.1.1.1-7hnFR.Ok_TOx.xjJwpMomcyrXDS_a_xHm5zNQvoEg7fulrb8CUYTffAt20OnXH7D1CpW3n_MME9Y56IQWA3jTwagYA6mghlMmMPBO5ZNztiLPhlu278YwTfl7xUwWfvdaNjBTXTJvOrzs3J03eDQFOCrTB4Z565Y_MGQDyUirZQlhlesCkr.AwOIJ3gbST_5_EGJexQgtqXXxg07b2xYHvC1IYrwRrQGgwMBLdHQ6SR.crDWt8Ve8sRekWEAMRVeXdBV9D7pSzya_sIHG.xFUWmBedNXVbTB45fKcTZRRFmfxdEBBVuNulq4J0H8vt71t2QEvHCQanY0UMBsey6685O6fKGQ3fwmDLxEf9QhysXm0P7H1lGwEdORvAv2knc1jlNNGLfBLA9x7mGgh2DjtQpdlf6Y9.BLbwvNiX0uGk755XlVLsQK9Cc6l34JUm41o7eA_sl.qaUsVTx9mcjj89g7aVhCx1FzLCmTQ22xbsJJAi5mLBZPIpJLXjM40w.JhOaJq5prDGylE.VU_6NlNz0ugnrr_XSSuTq3u87h_RAIVM788wwOvOTvSUDfN8DLA9JfBojOzvBZtFiQ_XJ7oRzIHL6lbtBvr48KWgCzWYtjiWBpSkSzZZ0soywGqQsXALUMJKB8P2SQldqfVDNCcUtbz4IRVXddz4St91pOvVLiv0S6Lo_a_uFFWyBipQHOU8oR9VNgNwMXJRRDc0MIvv3YbqqFlr0Q1f.U5mvcuSMA63c2bJfdXMVYYxTzxhDqGGMrN193YrVf5zRF8iJ2BGJdjkaUy6Q7byl0Ym1QAeAVmwhrRjxxFVUeAXMH4ZecwHe.pQ4XkYXnBDFVm6rmGVDdts570ZJJxP45QCePTjPjnPQpXti_ANZFzBDhujv8zBnN8fybmdEfLz3AsryqiOobR1Gbj2LcRvh7t7psBUTdmfrHjyRbqQ4u6nvTkhpGoSbbCtF6XJVAB80m8wFqSGlP8M0S7mY1vSdINohmzG2Hw9WbYA1jgz4.AI7koP0pn5ymL6recpYc3rnKuQ.QlPUv024gNOL4495PjA2gkzix0FCFvXFLJZVgUnB459t4p.XHNO_Iiw.xQIP6xtdWHrWh.BypuMJG_MDxa15g3S2Z.GKE97tDOUJm7iS0BPAYdGzQN2PYz7mukrpHWWAl3xCCpK8bsqVuxWvfvDb.HCcZMSlS_apDSifS5RnAdHQlq4UA.axKNLcR9iig8eCmeSE3MlPnS5FX3vxfWtT.jYa.BelKM1xp20S.9kIvIiW3xVqIT3OpcNk_RQ17oHJG1IfD4qcze5yFuNSXFGzBL42RZ6O0DF9eA9lmspg7Q.TU8MGnDHV6wZKWVFxVHiqPxH3GDck9.S8rKgcUdMWu2zQMHpMYRUXBl2MuzDcTknuY.5GkQ45SiALOduDAiMa12eEg25ff0pGjXXoNUXp_5MDGwVYR0SBBvXwwy3zqpdVxtdE.7NvohKtiasLhaO03mbcKl3BdOHLHbdiHROeQGayE_jjLRqoC90GvHGNFnoG6jNaj9oOGTK0z2Y9RPFv6cL1OA5e8BVBB7FFfdt8bc8cumz9hiX3sS7pRoFhh0MPK_zBpcDkCb0oluR.a4QNhm3cGRTvdbnL19ziBK8s6mCDlm96Gzny1gd68lQcm_GDyh1c4ZOIZYi7swRGdkB9ROQ",mdrd: "tOvcCi54EPil8GV6QYGPUjXJTRB4lLRssRC84TpdyPE-1712024897-1.1.1.1-RwaSk9j4VKERDf2YSW1sU9x_C9T8dBnTeDmDifoir_dI7TxLNY2hIRa3UCLfHLU4.fVMvCoNQhoDq.73KwwQ.9O_Cp2EKjk.S9i5s3bspSOaXJSyN18MqoNKk67jT1Ei_edQP9AvjzS_8i1EeoFYOKi57b6mL7IOJvRgPGLQ2yZwsFEIjL7tLdzLW_ygso2ka_RqFSX2auzruElu2zFRpvq5GqgDpFqfF5vuGL8GTQdRlQe4L0ZOLZXH8.1WeQcaiuuVJhN5qnfot2jTEVBoF..gc9aL_hM2CLh_cuBBKjn1vT5XWP4OomWLOZEBenu0nKBzKNxqxlQ1s0FcMIM7GeNhIyS1gCG2AVVTeyrT.9gPl2jB8kilQvzcE5jnehJP22zARftz18_gFsj4Rhf7JK8Qjgs_lH_hG__xOsXg50VfJBO_qLQy6xBFFG.Q0UrmZ0Q8YgxQ.yrXpgpLz6zqP1kJuk4OwC1ecAOmwD50u9PO6U9kebWJ0Eo__0XZgJH.twswZUqC6t18qAH0NDTRcIZDBko3CDNfviMrSQT2VQrCR_YBmB1lUxH35va05zbSrdAW56S4L4C1omJFpbBSuEfDcO.DyotPgqx6S_e4vPSc1iZ6vx5LHxB4VzBYP7MJHIhLlXvfnNwMww3R65RlGNHVb6gSCnGiQINOToYZCa_betNEBoVSZxm.K4m25kS1v5zFQo9pqwzqdXTHN5NNOgvNlkX5AR0tP9N0foYx3Of5BJJA7pmSnVd4qPwe7.HmYzGrJEUz7bPdz331.oiMqK0OxwiyP114sO.tKyWeJJs6ySFEwBTD3AuP1yyDVXu5SYf6PN3e1XE4qVVHJWUYJZwIApQYWgdgEIs32Pqevx41wSIRK_5rBqa6cxrbhkUzyp7sXIFBuumyrBRQsb0v0WUJn7xSpHZCLuo9.zTqqUqWoUXC4CscAEP2caiC_nvcqI_xUQEtgMe6Ll9uttNYwdSMrG3jfLwEg_T7r8gs1Jhz9Eqhwg00M3_VusHH6tW7T70.1DkPPf2egJU3cEH09UCgNvDp3JIsDcv3UBSpesDBFtENx1l7SKqg2WOf1l4CpViU4H1GzfnttWcy1PMkKnyEBWiDfX6HMto.GyEnMxRQ8INXDYTpMFzaWIEBg3E2iwOZxQyzVtXQtgExo5KcIDq1KWJxy2QM8PnqyKAY56o_huf2afwYA5.dg4tofoOKsGGzhqFZ6k5ZPQiZqpggzCdIuBd4LJbiVr1LMYyhyf2jZwdHJz7An8Fx38BsS_vRjr_Fm5L0yhlSDbvW3SOhVf3yXFJg5qwl4YA4E1.2h.ofJ7AqaxvaqOHd1kkcpkLalnmWgEm5EVQIEprSy2Q4I2GAri_0Or7Vvz2fNt3v1u8SNWLDDCYmPC0REgUUVTMmyj4Gk1c0p.MNLbuuM8rMj3paofSM7F7fsi02HX458CmdY4ODulYCzVhvApYkiEgcc3FReFYML9412oNUYiRXUgtiquRLpkRWbZ0ppU_ZmhflK6Vq4ZFo0eUwYHJ4C8svFUwL7lnAqC2JYr5SdrXP.F3jbST0KVHhWJbM831.btcxFjiWZep2DCVwbqWkmw9vYz56aQs.tHOaZImoNkkMSjYDxTtd62w_lD1DGnv5ufX1RRK3jizjivGqnSvNJYkOjMCS5jxMZfdAEaJsCZ1FW2JYLHbyFO8AYDfAOt7TFfeLYr8xztjkBr9QAPKOgS9MNDii5F575575_CqBKlAZo0_d1u7MVVW.i14wUby35EWr1mRicOcgkcruF2_YdNtOKqqG736y.gLdKXMOR9_X2rDOKq9ra36qX.IBjftQ09ocfQu9_DpuzpSbr9u_5jM5ab8DaAXWBjWVC2OLYfw.qvdZUHQUOSP06TtzmMVHH7CBOGdK0Vf4Roaha9vD18s.JjIXJ_3rAHTb6SXVTYZGtlvM95ijFOMeEx3lQGmsj0XeJD117b4mLbVQU6ruocRjhRVyOUuoIL79vFZu7OWpW2IgLz.aW4pEJgZdp88UK8rJHZL2loqBs9Y6gAAzq.o4YbQo7jGLeVNzy4IoFedhJTGHtPYw4j.UzUH5vgK7PmZW1TweZjxoo_I5SnUtO66T34jKHOtEUGh3Wxt8sY2SGgQ4liEpXxh_vV0rfiJpDwPBPAM0KkdH1fDK5a7bjHZWcnuv4yO0dFlIywnaLozIeaaAn_ZW2BwGGn0C.f00jP6rhgZT87cJ.cFJZ5.NT9L0GaBkb.JpcrkmWHuxNHBqU_0hEOnQBHYqXBcoBp726e0",cRq: {ru: 'aHR0cHM6Ly93d3cucG9pbnQyaG9tZXMuY29tL01YL1JlYWwtRXN0YXRlLUxpc3RpbmdzL0JhamEtQ2FsaWZvcm5pYS9FbnNlbmFkYS5odG1s',ra: 'TW96aWxsYS81LjAgKE1hY2ludG9zaDsgSW50ZWwgTWFjIE9TIFggMTBfMTVfNykgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzEyMy4wLjAuMCBTYWZhcmkvNTM3LjM2',rm: 'R0VU',d: 'tAzdQAI3mTspIlifwre/E/yuZXLDyqT1zC+S9ZKtO7r4NNgYBWQCM+ksWdpkBAfTfofu9pjH6idJ40A2PCEtiyXeBOnYueRJWcvFDp0Kmw37SWSDAlwaFsGollk7y2c/T2bVUNLv6ZXRdPJqrI3y/ZTGagPLuEjsD6Xj4CnnXnAozH+6MQfpDwZyk/9UZ0syfVzqpbU4wUyFTyQv6N2mgxrY5uh3IpeBHbxqYysealT8vIjlCHQghsTOYfS4yfuUtkgzM5qICEYkWlde8/7LBpcZazRw0w4TZM22zyKH4WBL6p1yXx1BAYEW3oNTV4N0FjFefogs0CPS2W5xKrv77Kg1vAhWX4Wgvnp0wCm/hv0F3bED0LAW48X15xYKU4vS7Y9LFkzPLqdM5BofixLVXsrk5gArPlN60HWn25yrt+gXNY45BXnsJJFSwd0xazRnwBTJ+paedl9KDznbZn/j5cG1y7xNbS0U2eCW8OjvlknpOEwbNd3d0fDlgB9qjy6P1lwhrLV2AoiAG3E5/1G+jRJGtUWB//tEHHOa2TQuP2w6y+FOMM3beeGoeYnWlZ4B06k58ZqUrccJUo26snj76H8WRXjSHlHw8aVCjgssws8v9Zi0j19iczebXOqxthUp',t: 'MTcxMjAyNDg5Ny43NDIwMDA=',cT: Math.floor(Date.now() / 1000),m: 'Se/OwhFUBNa/T/Lu0mKFzCER2AVQbTbdxAELX4tBJkg=',i1: 'F8UsBe0XJLicI/mI/Tr98w==',i2: 'iRLpLePq3cRvVWOi/r/C3A==',zh: 'V+pJxPvTFQAe4s8SjKDN4Hli+QuFq0ATf6jWbKaW9cU=',uh: '5C+C+yWwxMhwHSFzD6IXss7k3Ce7cQUSnvOIGdr3VHo=',hh: 'BRQHSNvv5sc0QVAxVBrsmQE4bPNHdwSCj/XVncjXeww=',}};var cpo = document.createElement('script');cpo.src = '/cdn-cgi/challenge-platform/h/g/orchestrate/chl_page/v1?ray=86dd627ae96a08cf';window._cf_chl_opt.cOgUHash = location.hash === '' && location.href.indexOf('#') !== -1 ? '#' : location.hash;window._cf_chl_opt.cOgUQuery = location.search === '' && location.href.slice(0, location.href.length - window._cf_chl_opt.cOgUHash.length).indexOf('?') !== -1 ? '?' : location.search;if (window.history && window.history.replaceState) {var ogU = location.pathname + window._cf_chl_opt.cOgUQuery + window._cf_chl_opt.cOgUHash;history.replaceState(null, null, "\/MX\/Real-Estate-Listings\/Baja-California\/Ensenada.html?__cf_chl_rt_tk=YKvzewEsRagBb8Lj2b3EMVrgYysUsTNYHxILZ0zsMLs-1712024897-0.0.1.1-1749" + window._cf_chl_opt.cOgUHash);cpo.onload = function() {history.replaceState(null, null, ogU);}}document.getElementsByTagName('head')[0].appendChild(cpo);}());
</script>
</body>
</html>
Oh no we were blocked immediatly, javascript must be enabled to view the website.
Most companies have javascript scrolling and bot detection. We will use selenium to bypass this. Selenium will open up a browser and scroll for us sp we can use it to scrape the website. You may have to manually accept cookies and scroll down to load all of the listings. Once that is done we will take another look at the soup. If you havent figured it out by now the soup is thehtml request from the website. we make one call and then store the html in memory.
from selenium import webdriver
= webdriver.Firefox()
driver
driver.get(URL)= BeautifulSoup(driver.page_source, 'html.parser') soup
print(soup.prettify()[:1000])
# Close the browser
driver.quit()
<html lang="en">
<head>
<script async="" src="https://matomo.mgmt.sharks.cloud/js/container_88C40TBZ.js">
</script>
<script async="" src="https://www.clarity.ms/s/0.7.26/clarity.js">
</script>
<script async="" src="https://www.clarity.ms/tag/kk321qb0ir?ref=gtm2">
</script>
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-XBM0745T8C&l=dataLayer&cx=c" type="text/javascript">
</script>
<script src="https://securepubads.g.doubleclick.net/pagead/managed/js/gpt/m202403270101/pubads_impl_page_level_ads.js">
</script>
<title>
Ensenada, Baja California, Mexico Real Estate & Homes for Sale | Point2
</title>
<link crossorigin="" href="https://cdn.point2homes.com" rel="preconnect"/>
<link crossorigin="" href="https://mediavault.point2.com" rel="preconnect"/>
<link href="https://www.point2homes.com/MX/Real-Estate-Listings/Baja-California/Ensenada.html" hreflang="en" rel="alternate"/>
<link href="https://www.point2homes.com/ES/MX/Rea
Success! we have the html of the website. Now we will use beautiful soup to parse the html and extract the data we need.
For this step i will usually pull up the inspect in chrome developer tools. use the selector tool to find the class or id of the element i want to scrape. Then i will use the soup.select() method to get the data i need.
# extract address-container
= soup.find_all('div', class_='address-container')
address_container # print 500 characters of the address_container
print(address_container[0].prettify()[:500])
<div class="address-container" data-address="Privada Toledo 1536, Ensenada, Baja California" title="Residential Property for sale in Privada Toledo 1536, Ensenada, Baja California">
Privada Toledo 1536, Ensenada, Baja California
</div>
Title address seems to have some interesting data. the title and address are csv seperated. we could split this into multiple columns later. the last two will be city and state and then we can assume the remaining columns are the title of the listing. lets keep going and see what we can find.
# placeholder : extract title, address city, state from address_container
# inside div class characteristics-cnt extract beds, baths, sqft, price
= soup.find_all('div', class_='characteristics-cnt')
characteristics_cnt print(characteristics_cnt[0].prettify()[:500])
<div class="characteristics-cnt">
<ul>
<li class="ic-beds" data-label="Beds">
<strong>
3
</strong>
<span class="gray normal-lbl">
Beds
</span>
<span class="gray short-lbl">
Bds
</span>
</li>
<li class="ic-baths" data-label="Baths">
<strong>
2
</strong>
<span class="gray normal-lbl">
Baths
</span>
<span class="gray short-lbl">
Ba
</span>
</li>
<li class="ic-sqft" data-label="Sqft">
<strong>
1,829.86
</strong>
<spa
Web scraping pro tip.
This is kind of a tedious process and I’m not very patient so i took a sample of the html and gave it to claude.ai to identify the div and classes etc that i need to extract the information. Claude is amazing for this sort of task and you just have to be a little sneaky about how you use it. I will use the information to extract the data i need.
# Find all the listing elements
= soup.select('div.item-cnt')
listings
# Iterate over each listing
for listing in listings:
# Extract the address
= listing.select_one('div.address-container')
address_element = address_element['data-address'] if address_element else None
address
# Extract the number of bedrooms
= listing.select_one('li.ic-beds strong')
beds_element = beds_element.text if beds_element else None
beds
# Extract the number of bathrooms
= listing.select_one('li.ic-baths strong')
baths_element = baths_element.text if baths_element else None
baths
# Extract the square footage
= listing.select_one('li.ic-sqft strong')
sqft_element = sqft_element.text if sqft_element else None
sqft
# Extract the lot size
= listing.select_one('li.ic-lotsize strong')
lot_size_element = lot_size_element.text if lot_size_element else None
lot_size
# Extract the property type
= listing.select_one('li.ic-proptype')
property_type_element = property_type_element.text if property_type_element else None
property_type
# Extract the price
= listing.select_one('div.price')
price_element = price_element.text if price_element else None
price
# Extract the listing agent name and company
= listing.select_one('div.agent-name a')
agent_name_element = agent_name_element.text if agent_name_element else None
agent_name
= listing.select_one('div.agent-company')
agent_company_element = agent_company_element.text if agent_company_element else None
agent_company
# # Print the extracted information
# print(f'Address: {address}')
# print(f'Beds: {beds}')
# print(f'Baths: {baths}')
# print(f'Square Feet: {sqft}')
# print(f'Lot Size: {lot_size}')
# print(f'Property Type: {property_type}')
# print(f'Price: {price}')
# print(f'Listing Agent: {agent_name}, {agent_company}')
# print('---')
# count the number of listings in soup
print(f'Total number of listings: {len(listings)}')
Total number of listings: 24
Excellent!
This si a decent start. but we are missing a lot of data. We will need to do do 2 things. First scroll down and load all of the listings with selenium download that code and then we need to find the next page button click it scroll and store the data for all of the pages.
# to be continued
[]