java - jsoup crawling image width and height from amazon.com link -
following example amazon link trying crawl image's width , height:
http://images.amazon.com/images/p/0099441365.01.sclzzzzzzz.jpg
i using jsoup , following code:
import java.io.*; import org.jsoup.*; import org.jsoup.nodes.document; import org.jsoup.select.elements; public class crawler_main { /** * @param args */ public static void main(string[] args) { // todo auto-generated method stub string filepath = "c:/imagelinks.txt"; try (bufferedreader br = new bufferedreader(new filereader(filepath))) { string line; string width; //string height; while ((line = br.readline()) != null) { // process line. system.out.println(line); document doc = jsoup.connect(line).ignorecontenttype(true).get(); //system.out.println(doc.tostring()); elements jpg = doc.getelementsbytag("img"); width = jpg.attr("width"); system.out.println(width); //string title = doc.title(); } } catch (filenotfoundexception ex){ system.out.println("file not found"); } catch(ioexception ex){ system.out.println("unable read line"); } catch (exception ex){ system.out.println("exception occured"); } } }
the html fetched when extract width attribute, returns null. when printed html fetched, contains garbadge characters (i guessing actual image information calling garbadge characters. example:
i cant paste document.tostring() result in editor. help!
the problem you're fetching jpg file, not html. call ignorecontenttype(true) provides clue, documentation states:
ignore document's content-type when parsing response. default false, unrecognised content-type cause ioexception thrown. (this prevent producing garbage attempting parse jpeg binary image, example.)
if want obtain width of actual jpg file, this answer may of use:
bufferedimage bimg = imageio.read(new file(filename)); int width = bimg.getwidth(); int height = bimg.getheight();
Comments
Post a Comment