Bash script to download images from website
Image crawlers are very useful when we need to download all the images that appear in a web page. Instead of going through the HTML sources and picking all the images, we can use a script to parse the image files and download them automatically. Let’s see how to do it.
#!/bin/bash #Description: Images downloader #Filename: img_downloader.sh if [ $# -ne 3 ]; then echo "Usage: $0 URL -d DIRECTORY" exit -1 fi for i in {1..4} do case $1 in -d) shift; directory=$1; shift ;; *) url=${url:-$1}; shift;; esac done mkdir -p $directory; baseurl=$(echo $url | egrep -o "https?://[a-z.]+") curl –s $url | egrep -o "<img src=[^>]*>" | sed 's/<img src=\"\([^"]*\).*/\1/g' > /tmp/$$.list sed -i "s|^/|$baseurl/|" /tmp/$$.list cd $directory; while read filename; do curl –s -O "$filename" --silent done < /tmp/$$.list
An example usage is as follows:
$ ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images
How it works…
The above image downloader script parses an HTML page, strips out all tags except <img>, then parses src=”URL” from the <img> tag and downloads them to the specified directory.
This script accepts a web page URL and the destination directory path as command-line arguments. The first part of the script is a tricky way to parse command-line arguments.
The [ $# -ne 3 ] statement checks whether the total number of arguments to the script is three, else it exits and returns a usage example.
If it is 3 arguments, then parse the URL and the destination directory. In order to do that a tricky hack is used:
for i in {1..4} do case $1 in -d) shift; directory=$1; shift ;; *) url=${url:-$1}; shift;; esac done
A for loop is iterated four times (there is no significance to the number four, it is just to iterate a couple of times to run the case statement).
The case statement will evaluate the first argument ($1), and matches -d or any other string arguments that are checked. We can place the -d argument anywhere in the format as follows:
$ ./img_downloader.sh -d DIR URL
Or:
$ ./img_downloader.sh URL -d DIR
shift is used to shift arguments such that when shift is called $1 will be assigned with $2, when again called $1=$3 and so on as it shifts $1 to the next arguments. Hence we can evaluate all arguments through $1 itself. When -d is matched ( -d) ), it is obvious that the next argument is the value for the destination directory. *) corresponds to default match. It will match anything other than -d. Hence while iteration $1=”” or $1=URL in the default match, we need to take $1=URL avoiding “” to overwrite. Hence we use the url=${url:-$1} trick. It will return a URL value
if already not “” else it will assign $1.
egrep -o “<img src=[^>]*>” will print only the matching strings, which are the <img> tags including their attributes. [^>]* used to match all characters except the closing >, that is, <img src=”image.jpg” …. >.
In case of any ©Copyright or missing credits issue please check CopyRights page for faster resolutions.
Can it be done to download all the executable files or files with a specific extension?
Yes, it can be but you have to tweak the script to find the tags which contains the files with specific extension. Like here for images we use to parse the HTML for
< img src= tag
Great job! Well done. Elegant.
I was wondering though, is there a way to use it for multiple sites? In other words, if I would feed the script a CSV (or some other sort of list) with n URL’s so it could do this with n websites?
That would be awesome man!
Hi Razgorov,
It’s easy to do, you can do a new wrapper script(masterimgdl.sh) which will execute the image down-loader script again and again till the list of urls in your url_lists.txt file ends. So you need to do following.
1. create a .txt file(url_lists.txt) with all your urls you want to crawl
2. create a new bash script with followign content. (masterimgdl.sh)
#!/bin/bash
exec < url_list.txt while read line do ./img_downloader.sh ${line} -d images done 3. create a new bash script, name it as img_downloader.sh using mentioned script/code in post 4. Put all 3 files in a single folder and give chmod 0755 * for all files. 5. Run the master wrapper script which you created in step1 (./masterimgdl.sh) HTH, Admin
Hi,
When i try this i get the following error
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 954 0 954 0 0 3538 0 --:--:-- --:--:-- --:--:-- 12230
100 360k 0 360k 0 0 182k 0 --:--:-- 0:00:01 --:--:-- 186k
curl: Remote file name has no length!
curl: try 'curl --help' or 'curl --manual' for more information
GIF89a����!�,D;curl: Remote file name has no length!
curl: try 'curl --help' or 'curl --manual' for more information
curl: Remote file name has no length!
curl: try 'curl --help' or 'curl --manual' for more information
Any help on this as I require some images for a Computer Vision assignment
It seems you have some curl version issues.. that’s why its throwing these errors.
curl -O -J -L $url
I used below version and it just works fine..
$ curl -V
curl 7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
Protocols: tftp ftp telnet dict ldap http file https ftps
Features: GSS-Negotiate IDN IPv6 Largefile NTLM SSL libz
same problem here, curl could not resolve host…
It returns me the following error, also tried other URL’s thus it’s the same:
sh-3.2# ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images
sh: ./img_downloader.sh: Permission denied
sh-3.2#
You have to give execute permission to before running the script.
chmod 0755 img_downloader.sh
./img_downloader.sh http://www.flickr.com/search/?q=linux -d images
Thanks a lot. Now when I’m trying to run the script, it returns the following result:
bash-3.2$ ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 –:–:– –:–:– –:–:– 0curl: (6) Could not resolve host: –s
100 72 100 72 0 0 88 0 –:–:– –:–:– –:–:– 96
sed: 1: “/tmp/906.list”: invalid command code 9
bash-3.2$ cat /tmp/906.list
bash-3.2$
replace this:
while read filename;
do
curl –s -O “$filename” –silent
done
wget -i /tmp/$$.list
or just use smth like this:
#!/bin/bash
#Description: Images downloader
#Filename: imgparser.sh
if [ $# -ne 1 ];
then
echo “Usage: $0 URL”
exit -1
fi
rm -rf images/;
mkdir images/;
url=$1;
baseurl=$(echo $url | egrep -o “https?://[a-z.]+”)
# baseurl=$1;
curl –s $url > ./tmp/list.txt;
cat ./tmp/list.txt | egrep -o “]*>” | sed ‘s/ ./tmp/list_parsed.txt
sed -i.bak “s|^[\/]*|$baseurl/|g” ./tmp/list_parsed.txt;
cd images/
wget -i ../tmp/list_parsed.txt