com.mindprod.entities
Class DeEntifyStrings

java.lang.Object
  extended by com.mindprod.entities.DeEntifyStrings
Direct Known Subclasses:
DeEntify, Flatten

public class DeEntifyStrings
extends java.lang.Object

Strips HTML entities such as " from a string, replacing them by their Unicode equivalents.

Since:
2002-07-14
Version:
3.0 2011-02-10 rename to deEntify, delete deprecated methods.
Author:
Roedy Green, Canadian Mind Products
See Also:
DeEntify, DeEntifyStrings, Entify, EntifyStrings, Flatten

Field Summary
static int LONGEST_ENTITY
          Longest an entity can be, at least in our tables, including the lead & and trail ;.
static int SHORTEST_ENTITY
          The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.
static char UNICODE_NBSP_160_0x0a
          unicode nbsp control char, 160, 0x0a.
 
Constructor Summary
DeEntifyStrings()
           
 
Method Summary
static char bareHTMLEntityToChar(java.lang.String bareEntity, char howToTranslateNbsp)
          convert an entity to a single char.
static java.lang.String deEntifyHTML(java.lang.String text, char translateNbspTo)
          Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.
static java.lang.String deEntifyXML(java.lang.String text)
          Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.
static java.lang.String flattenHTML(java.lang.String text, char translateNbspTo)
          strips tags and entities from HTML.
static java.lang.String flattenXML(java.lang.String text)
          strips tags and entities from XML..
protected static char possBareHTMLEntityWithSemicolonToChar(java.lang.String possBareEntityWithSemicolon, char translateNbspTo)
          Checks a number of gauntlet conditions to ensure this is a valid entity.
static char possEntityToChar(java.lang.String possBareEntityWithSemicolon)
          Checks a number of gauntlet conditions to ensure this is a valid entity.
static java.lang.String stripHTMLTags(java.lang.String html)
          Removes tags from HTML leaving just the raw text.
static java.lang.String stripXMLTags(java.lang.String xml)
          Removes tags from XML leaving just the raw text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNICODE_NBSP_160_0x0a

public static final char UNICODE_NBSP_160_0x0a
unicode nbsp control char, 160, 0x0a.

See Also:
Constant Field Values

LONGEST_ENTITY

public static final int LONGEST_ENTITY
Longest an entity can be, at least in our tables, including the lead & and trail ;.


SHORTEST_ENTITY

public static final int SHORTEST_ENTITY
The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.

See Also:
Constant Field Values
Constructor Detail

DeEntifyStrings

public DeEntifyStrings()
Method Detail

bareHTMLEntityToChar

public static char bareHTMLEntityToChar(java.lang.String bareEntity,
                                        char howToTranslateNbsp)
convert an entity to a single char.

Parameters:
bareEntity - String entity to convert convert. must have lead & and trail ; stripped; may have form: #x12ff or #123 or lt or nbsp style entity. Works faster if entity in lower case.
howToTranslateNbsp - char you would like   translated to, usually ' ' or (char) 160
Returns:
equivalent character. 0 if not recognised.

deEntifyHTML

public static java.lang.String deEntifyHTML(java.lang.String text,
                                            char translateNbspTo)
Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.

Parameters:
text - raw text to be processed. Must not be null.
translateNbspTo - char you would like   translated to, usually ' ' or (char) 160 .
Returns:
translated text. It also handles HTML 4.0 entities such as ♥ { and ￿   -> 160. null input returns null.

deEntifyXML

public static java.lang.String deEntifyXML(java.lang.String text)
Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.

Parameters:
text - raw XML text to be processed. Must not be null.
Returns:
translated text. null input returns null.

flattenHTML

public static java.lang.String flattenHTML(java.lang.String text,
                                           char translateNbspTo)
strips tags and entities from HTML. Leaves \n \r unchanged.

Parameters:
text - to flatten
translateNbspTo - char you would like   translated to, usually ' ' or (char) 160 .
Returns:
flattened text

flattenXML

public static java.lang.String flattenXML(java.lang.String text)
strips tags and entities from XML..

Parameters:
text - to flatten
Returns:
flattened text

possEntityToChar

public static char possEntityToChar(java.lang.String possBareEntityWithSemicolon)
Checks a number of gauntlet conditions to ensure this is a valid entity. Converts Entity to corresponding char.

Parameters:
possBareEntityWithSemicolon - string that may hold an entity. Lead & must be stripped, but may optionally contain text past the ;
Returns:
corresponding unicode character, or 0 if the entity is invalid. nbsp -> (char) 160

stripHTMLTags

public static java.lang.String stripHTMLTags(java.lang.String html)
Removes tags from HTML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed HTML, no > in comments, all <...> balanced. Also removes text between applet, style and script tag pairs. Leaves   and other entities as is.

Parameters:
html - input HTML
Returns:
raw text, with whitespaces collapsed to a single space, trimmed.

stripXMLTags

public static java.lang.String stripXMLTags(java.lang.String xml)
Removes tags from XML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed XML, no > in comments, all <...> balanced. Leaves entities as is.

Parameters:
xml - input XML
Returns:
raw text, with whitespaces collapsed to a single space, trimmed.

possBareHTMLEntityWithSemicolonToChar

protected static char possBareHTMLEntityWithSemicolonToChar(java.lang.String possBareEntityWithSemicolon,
                                                            char translateNbspTo)
Checks a number of gauntlet conditions to ensure this is a valid entity. Converts Entity to corresponding char.

Parameters:
possBareEntityWithSemicolon - string that may hold an entity. Lead & must be stripped, but may optionally contain text past the ;
translateNbspTo - char you would like nbsp translated to, usually ' ' or (char) 160 .
Returns:
corresponding unicode character, or 0 if the entity is invalid.