java - jsoup crawling image width and height from amazon.com link -


following example amazon link trying crawl image's width , height:

http://images.amazon.com/images/p/0099441365.01.sclzzzzzzz.jpg

i using jsoup , following code:

import java.io.*; import org.jsoup.*; import org.jsoup.nodes.document; import org.jsoup.select.elements; public class crawler_main {  /**  * @param args  */ public static void main(string[] args) {     // todo auto-generated method stub     string filepath = "c:/imagelinks.txt";     try (bufferedreader br = new bufferedreader(new filereader(filepath))) {         string line;         string width;         //string height;         while ((line = br.readline()) != null) {            // process line.             system.out.println(line);             document doc = jsoup.connect(line).ignorecontenttype(true).get();             //system.out.println(doc.tostring());             elements jpg = doc.getelementsbytag("img");             width = jpg.attr("width");             system.out.println(width);             //string title = doc.title();         }     }     catch (filenotfoundexception ex){         system.out.println("file not found");     }     catch(ioexception ex){         system.out.println("unable read line");     }     catch (exception ex){         system.out.println("exception occured");     } }  } 

the html fetched when extract width attribute, returns null. when printed html fetched, contains garbadge characters (i guessing actual image information calling garbadge characters. example:

i cant paste document.tostring() result in editor. help!

the problem you're fetching jpg file, not html. call ignorecontenttype(true) provides clue, documentation states:

ignore document's content-type when parsing response. default false, unrecognised content-type cause ioexception thrown. (this prevent producing garbage attempting parse jpeg binary image, example.)

if want obtain width of actual jpg file, this answer may of use:

bufferedimage bimg = imageio.read(new file(filename)); int width          = bimg.getwidth(); int height         = bimg.getheight(); 

Comments

Popular posts from this blog

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -

ubuntu - Selenium Node Not Connecting to Hub, Not Opening Port -