java - How to extract links from a web content? -
i have download web page , want extract links in file. links include absolutes , relatives. example have :
<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script>
or
<a href="http://stackoverflow.com/" />
so after reading file, should do?
this isn't complicated do, if want use builtin regex system java. hard bit finding right regex match urls[1][2]. sake of answer, i'm gonna assume you've done that, , stored pattern
syntax along lines of this:
pattern url = pattern.compile("your regex here");
and way of iterating through each line. you'll want define arraylist<string>
:
arraylist<string> urlsfound = new arraylist<>();
from there, you'll have loop iterate through file (assuming each line <? extends charsequence> line
), , inside you'll put this:
matcher urlmatch = url.matcher(line); while (urlmatch.find()) urlsfound.add(urlmatch.match());
what create matcher
line , url-matching pattern
before. then, loops until #find()
returns false (i.e., there no more matches) , adds match (with #group()
) list, urlsfound
.
at end of loop, urlsfound
contain matches of urls on page. note can quite memory-intensive if you've got lot of text, urlsfound
quite big, , you'll creating , ditching lot of matcher
s.
1: found few sites a quick google search; cream of crop seem here , here, far can tell. needs may vary.
2: you'll need make sure entire url captured single group, or won't work @ all. can tweaked work if there multiple parts, though.
Comments
Post a Comment