Vado a riesumare il post.
Ho un dubbio, sto usando HTML Cleaner e sto seguendo un approccio del tipo:
codice:
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
// apertura della connessione
URL url;
try {
url = new URL(url_str);
URLConnection conn = url.openConnection();
rootNode = cleaner.clean(new InputStreamReader(conn.getInputStream()));
Vorrei sapere se esiste un modo per tirar fuori l'html completo e ripulito, un pò come fa nella home page del progetto HTML Cleaner. Esempio:
Here is a typical example - improperly structured HTML containing unclosed tags and missing quotes:
codice HTML:
<table id=table1 cellspacing=2px
<h1>CONTENT</h1>
<td><a href=index.html>1 -> Home Page</a>
<td><a href=intro.html>2 -> Introduction</a>
After putting it through HtmlCleaner, XML similar to the following is coming out:
codice HTML:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head />
<body>
<h1>CONTENT</h1>
<table id="table1" cellspacing="2px">
<tbody>
<tr>
<td>
<a href="index.html">1 -> Home Page</a>
</td>
<td>
<a href="intro.html">2 -> Introduction</a>
</td>
</tr>
</tbody>
</table>
</body>
</html>
E' possibile? Come posso fare?