heard about it a lot and I had the chance -finally- to use it on one of my projects. This is an introductory tutorial of the Jsoup HTML parser.
What is Jsoup?!
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
With Jsoup we are able to:
- Scrape and parse HTML from a URL, file, or string
- Find and extract data, using DOM traversal or CSS selectors
- Manipulate the HTML elements, attributes and text
- clean user-submitted content against a safe white-list, to prevent XSS attacks
- Output tidy HTML