com.mindprod.entities
Class StripEntities

java.lang.Object
  extended by com.mindprod.entities.StripEntities
Direct Known Subclasses:
StripFileEntities

public class StripEntities
extends java.lang.Object

Strips HTML entities such as " from a string, replacing them by their Unicode equivalents.

Since:
2002-07-14
Version:
2.9 2010-01-29 export XHTML entities (currently same as HTML-4 entities).
Author:
Roedy Green, Canadian Mind Products
See Also:
InsertEntities, InsertFileEntities, StripEntities, StripFileEntities

Field Summary
static int LONGEST_ENTITY
          Longest an entity can be , at least in our tables, including the lead & and trail ;.
static int SHORTEST_ENTITY
          The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.
static char UNICODE_NBSP_160_0x0a
          unicode nbsp control char, 160, 0x0a.
 
Constructor Summary
StripEntities()
           
 
Method Summary
static char bareHTMLEntityToChar(java.lang.String bareEntity, char howToTranslateNbsp)
          convert an entity to a single char.
static char entityToChar(java.lang.String entity)
          Deprecated. replaced with bareHTMLEntityToChar(String,char)
static java.lang.String flattenHTML(java.lang.String text, char translateNbspTo)
          strips tags and entities from HTML.
static java.lang.String flattenXML(java.lang.String text)
          strips tags and entities from XML..
protected static char possBareHTMLEntityWithSemicolonToChar(java.lang.String possBareEntityWithSemicolon, char translateNbspTo)
          Checks a number of gauntlet conditions to ensure this is a valid entity.
static char possEntityToChar(java.lang.String possBareEntityWithSemicolon)
          Checks a number of gauntlet conditions to ensure this is a valid entity.
static java.lang.String stripEntities(java.lang.String text)
          Deprecated. use stripHTMLEntities or stripXML entities..
static java.lang.String stripHTMLEntities(java.lang.String text, char translateNbspTo)
          Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.
static java.lang.String stripHTMLTags(java.lang.String html)
          Removes tags from HTML leaving just the raw text.
static java.lang.String stripNbsp(java.lang.String text)
          Deprecated. stripNbspShould no longer be necessary. stripEntities(String,char) now lets you specify directly the translation of nbsp you want.
static java.lang.String stripTags(java.lang.String html)
          Deprecated. use stripHTMLTags or stripXMLTags instead.
static java.lang.String stripXMLEntities(java.lang.String text)
          Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.
static java.lang.String stripXMLTags(java.lang.String xml)
          Removes tags from XML leaving just the raw text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNICODE_NBSP_160_0x0a

public static final char UNICODE_NBSP_160_0x0a
unicode nbsp control char, 160, 0x0a.

See Also:
Constant Field Values

LONGEST_ENTITY

public static final int LONGEST_ENTITY
Longest an entity can be , at least in our tables, including the lead & and trail ;.


SHORTEST_ENTITY

public static final int SHORTEST_ENTITY
The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.

See Also:
Constant Field Values
Constructor Detail

StripEntities

public StripEntities()
Method Detail

bareHTMLEntityToChar

public static char bareHTMLEntityToChar(java.lang.String bareEntity,
                                        char howToTranslateNbsp)
convert an entity to a single char.

Parameters:
bareEntity - String entity to convert convert. must have lead & and trail ; stripped; may have form: #x12ff or #123 or lt or nbsp style entity. Works faster if entity in lower case.
howToTranslateNbsp - char you would like   translated to, usually ' ' or (char) 160
Returns:
equivalent character. 0 if not recognised.

entityToChar

public static char entityToChar(java.lang.String entity)
Deprecated. replaced with bareHTMLEntityToChar(String,char)

convert an entity to a single char.

Parameters:
entity - String entity to convert convert. must have lead & and trail ; stripped; may be a #x12ff or #123 style entity. Works faster if entity in lower case.
Returns:
equivalent character. 0 if not recognised.,   -> (char) 160
See Also:
bareHTMLEntityToChar(String, char)

flattenHTML

public static java.lang.String flattenHTML(java.lang.String text,
                                           char translateNbspTo)
strips tags and entities from HTML. Leaves \n \r unchanged.

Parameters:
text - to flatten
translateNbspTo - char you would like   translated to, usually ' ' or (char) 160 .
Returns:
flattened text

flattenXML

public static java.lang.String flattenXML(java.lang.String text)
strips tags and entities from XML..

Parameters:
text - to flatten
Returns:
flattened text

possEntityToChar

public static char possEntityToChar(java.lang.String possBareEntityWithSemicolon)
Checks a number of gauntlet conditions to ensure this is a valid entity. Converts Entity to corresponding char.

Parameters:
possBareEntityWithSemicolon - string that may hold an entity. Lead & must be stripped, but may optionally contain text past the ;
Returns:
corresponding unicode character, or 0 if the entity is invalid. nbsp -> (char) 160

stripEntities

public static java.lang.String stripEntities(java.lang.String text)
Deprecated. use stripHTMLEntities or stripXML entities..

Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.

Parameters:
text - raw text to be processed. Must not be null.
Returns:
translated text. It also handles HTML 4.0 entities such as ♥ { and ￿   -> 160. null input returns null.
See Also:
stripHTMLEntities(String, char)

stripHTMLEntities

public static java.lang.String stripHTMLEntities(java.lang.String text,
                                                 char translateNbspTo)
Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.

Parameters:
text - raw text to be processed. Must not be null.
translateNbspTo - char you would like   translated to, usually ' ' or (char) 160 .
Returns:
translated text. It also handles HTML 4.0 entities such as ♥ { and ￿   -> 160. null input returns null.

stripHTMLTags

public static java.lang.String stripHTMLTags(java.lang.String html)
Removes tags from HTML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed HTML, no > in comments, all <...> balanced. Also removes text between applet, style and script tag pairs. Leaves   and other entities as is.

Parameters:
html - input HTML
Returns:
raw text, with whitespaces collapsed to a single space, trimmed.

stripNbsp

public static java.lang.String stripNbsp(java.lang.String text)
Deprecated. stripNbspShould no longer be necessary. stripEntities(String,char) now lets you specify directly the translation of nbsp you want.

converts all 160-style spaces (result of stripEntities on  ) to ordinary space.

Parameters:
text - Text to convert
Returns:
Text with 160-style spaces converted to ordinary spaces
See Also:
stripHTMLEntities(String, char)

stripTags

public static java.lang.String stripTags(java.lang.String html)
Deprecated. use stripHTMLTags or stripXMLTags instead.

Removes tags from HTML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed HTML, no > in comments, all <...> balanced. Also removes text between applet, style and script tag pairs. Leaves   and other entities as is.

Parameters:
html - input HTML
Returns:
raw text, with whitespaces collapsed to a single space, trimmed.

stripXMLEntities

public static java.lang.String stripXMLEntities(java.lang.String text)
Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.

Parameters:
text - raw XML text to be processed. Must not be null.
Returns:
translated text. null input returns null.

stripXMLTags

public static java.lang.String stripXMLTags(java.lang.String xml)
Removes tags from XML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed XML, no > in comments, all <...> balanced. Leaves entities as is.

Parameters:
xml - input XML
Returns:
raw text, with whitespaces collapsed to a single space, trimmed.

possBareHTMLEntityWithSemicolonToChar

protected static char possBareHTMLEntityWithSemicolonToChar(java.lang.String possBareEntityWithSemicolon,
                                                            char translateNbspTo)
Checks a number of gauntlet conditions to ensure this is a valid entity. Converts Entity to corresponding char.

Parameters:
possBareEntityWithSemicolon - string that may hold an entity. Lead & must be stripped, but may optionally contain text past the ;
translateNbspTo - char you would like nbsp translated to, usually ' ' or (char) 160 .
Returns:
corresponding unicode character, or 0 if the entity is invalid.