یافته های شخصی من

طبقه بندی موضوعی
کلمات کلیدی

Override Robots.txt With wget

سه شنبه, ۱۵ اسفند ۱۳۹۶، ۰۳:۰۴ ب.ظ

Override Robots.txt With wget

I find myself downloading lots of files from the web when converting sites into my company’s CMS. Whether from static sites or other CMS platforms, trying to do this manually sucks. But, thanks to wget’s recursive download feature, I can rip through a site, and get all of the images I need, while keeping even the folder structure.

One thing I found out was that wget respects robots.txt files, so the the site you are trying to copy has one with the right settings, wget will get only what is allowed. This is something that can be overridden with a few tweaks. I gladly used it and decided to pass it along. See the instructions at the site below.

UPDATE:
Thanks to @jcheshire who pointed out that wget actually has an ignore robots setting. Not the greatest documentation, but a much simpler process.


wget -e robots=off --wait 1 http://your.site.here
  • علی امین زاده

نظرات  (۰)

هیچ نظری هنوز ثبت نشده است
ارسال نظر آزاد است، اما اگر قبلا در بیان ثبت نام کرده اید می توانید ابتدا وارد شوید.
شما میتوانید از این تگهای html استفاده کنید:
<b> یا <strong>، <em> یا <i>، <u>، <strike> یا <s>، <sup>، <sub>، <blockquote>، <code>، <pre>، <hr>، <br>، <p>، <a href="" title="">، <span style="">، <div align="">
تجدید کد امنیتی