java - How to extract links from a web content? -
i have download web page , want extract links in file. links include absolutes , relatives. example have :
<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script> or
<a href="http://stackoverflow.com/" /> so after reading file, should do?
this isn't complicated do, if want use builtin regex system java. hard bit finding right regex match urls[1][2]. sake of answer, i'm gonna assume you've done that, , stored pattern syntax along lines of this:
pattern url = pattern.compile("your regex here"); and way of iterating through each line. you'll want define arraylist<string>:
arraylist<string> urlsfound = new arraylist<>(); from there, you'll have loop iterate through file (assuming each line <? extends charsequence> line), , inside you'll put this:
matcher urlmatch = url.matcher(line); while (urlmatch.find()) urlsfound.add(urlmatch.match()); what create matcher line , url-matching pattern before. then, loops until #find() returns false (i.e., there no more matches) , adds match (with #group()) list, urlsfound.
at end of loop, urlsfound contain matches of urls on page. note can quite memory-intensive if you've got lot of text, urlsfound quite big, , you'll creating , ditching lot of matchers.
1: found few sites a quick google search; cream of crop seem here , here, far can tell. needs may vary.
2: you'll need make sure entire url captured single group, or won't work @ all. can tweaked work if there multiple parts, though.
Comments
Post a Comment