We would like to crawl a vehicle ads website. This website has 400,000 used vehicle ads, and about 20,000 new ads per day. We want to setup something to crawl the 400K ands, and update it every day with the 20K new ads.
The crawl would be run using our company’s infrastructure. We only need someone to write the script to execute, and to find the best proxy method to crawl the website (Our company would of course pay for the proxies)
We want to crawl all pages starting from this page : https://www.leboncoin.fr/voitures/offres/?f=p
(all pages with used cars ads).
For each page, we also want to crawl all the pages with the detailed vehicle ad (like https://www.leboncoin.fr/voitures/1138888682.htm?ca=12_s)
For each vehicle ad, we want to crawl information from the ad:
– price (Prix)
– city (Ville)
– model (Modèle)
– year (Année-modèle)
– mileage (Kilométrage)
– description (Description)
– the phone number (hidden behind the button “Voir le numéro”)
The tricky part is crawling the phone number. It is behind the “Voir le numéro” button, which sends a POST request. It looks like the POST request needs the same IP as the one used to access the ad page.
When trying to do it, we are blocked by the website.
We tried to do it without proxy : we are blocked after 10 calls if we go too fast.
Of course you’ll have to use proxies. But when we used proxies we were blocked (the POST to get the phone numbers sends ” KO” instead of the number. We do not know why). So you’ll have to find a way to do it 🙂