I find myself downloading lots of files from the web when converting sites into my company’s CMS. Whether from static sites or other CMS platforms, trying to do this manually sucks. But, thanks to wget’s recursive download feature, I can rip through a site, and get all of the images I need, while keeping even the folder structure.

One thing I found out was that wget respects robots.txt files, so the the site you are trying to copy has one with the right settings, wget will get only what is allowed. This is something that can be overridden with a few tweaks. I gladly used it and decided to pass it along. See the instructions at the site below.

Ignoring robots restrictions with wget — bitbucket.org

UPDATE:

Thanks to @jcheshire who pointed out that wget actually has an ignore robots setting. Not the greatest documentation, but a much simpler process.

wget -e robots=off --wait 1 http://your.site.here

۰ ۰
۹۶/۱۲/۱۵

علی امین زاده

نظرات (۰)

هیچ نظری هنوز ثبت نشده است

ارسال نظر آزاد است، اما اگر قبلا در بیان ثبت نام کرده اید می توانید ابتدا وارد شوید.

نام *

پست الکترونيک

سایت یا وبلاگ

پیام *

شما میتوانید از این تگهای html استفاده کنید:

یا ، یا ، ، <strike> یا <s>، ، ، <blockquote>، <code>، <pre>، <hr>، ، ، <a href="" title="">، ، <div align="">

کد امنیتی *

ارقام فارسی و انگلیسی پذیرفته می‌شوند

نظر بصورت خصوصی ارسال شود

پست الکترونیک برای عموم قابل مشاهده باشد

یافته های شخصی من

یافته های شخصی من

تجربه

yii2

selenium

telegram

php

composer

mysql

postgre

python

apache

linux

css

git

bash

android

سایت

wget

Override Robots.txt With wget

نظرات (۰)