java - How to extract links from a web content? -


i have download web page , want extract links in file. links include absolutes , relatives. example have :

<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script> 

or

<a href="http://stackoverflow.com/" /> 

so after reading file, should do?

this isn't complicated do, if want use builtin regex system java. hard bit finding right regex match urls[1][2]. sake of answer, i'm gonna assume you've done that, , stored pattern syntax along lines of this:

pattern url = pattern.compile("your regex here"); 

and way of iterating through each line. you'll want define arraylist<string>:

arraylist<string> urlsfound = new arraylist<>(); 

from there, you'll have loop iterate through file (assuming each line <? extends charsequence> line), , inside you'll put this:

matcher urlmatch = url.matcher(line); while (urlmatch.find()) urlsfound.add(urlmatch.match()); 

what create matcher line , url-matching pattern before. then, loops until #find() returns false (i.e., there no more matches) , adds match (with #group()) list, urlsfound.

at end of loop, urlsfound contain matches of urls on page. note can quite memory-intensive if you've got lot of text, urlsfound quite big, , you'll creating , ditching lot of matchers.

1: found few sites a quick google search; cream of crop seem here , here, far can tell. needs may vary.

2: you'll need make sure entire url captured single group, or won't work @ all. can tweaked work if there multiple parts, though.


Comments

Popular posts from this blog

asp.net mvc - SSO between MVCForum and Umbraco7 -

Python Tkinter keyboard using bind -

ubuntu - Selenium Node Not Connecting to Hub, Not Opening Port -