java - How to extract links from a web content? -

February 15, 2015

i have download web page , want extract links in file. links include absolutes , relatives. example have :

<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script>

<a href="http://stackoverflow.com/" />

so after reading file, should do?

this isn't complicated do, if want use builtin regex system java. hard bit finding right regex match urls^[1][2]. sake of answer, i'm gonna assume you've done that, , stored pattern syntax along lines of this:

pattern url = pattern.compile("your regex here");

and way of iterating through each line. you'll want define arraylist<string>:

arraylist<string> urlsfound = new arraylist<>();

from there, you'll have loop iterate through file (assuming each line <? extends charsequence> line), , inside you'll put this:

matcher urlmatch = url.matcher(line); while (urlmatch.find()) urlsfound.add(urlmatch.match());

what create matcher line , url-matching pattern before. then, loops until #find() returns false (i.e., there no more matches) , adds match (with #group()) list, urlsfound.

at end of loop, urlsfound contain matches of urls on page. note can quite memory-intensive if you've got lot of text, urlsfound quite big, , you'll creating , ditching lot of matchers.

^{1: found few sites a quick google search; cream of crop seem here , here, far can tell. needs may vary.}

^{2: you'll need make sure entire url captured single group, or won't work @ all. can tweaked work if there multiple parts, though.}

Search This Blog

UV code

java - How to extract links from a web content? -

Comments

Post a Comment

Popular posts from this blog

jquery - How do you format the date used in the popover widget title of FullCalendar? -

Bubble Sort Manually a Linked List in Java -

asp.net mvc - SSO between MVCForum and Umbraco7 -