Using PHP substr() and strip_tags() while retaining formatting and without breaking HTML

Friday, May 4, 2012

Using PHP substr() and strip_tags() while retaining formatting and without breaking HTML

I have various HTML strings to cut to 100 characters (of the stripped content, not the original) without stripping tags and without breaking HTML.

Original HTML string (288 characters):




$content = "<div>With a <span class='spanClass'>span over here</span> and a

<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>

</div> and a lot of other nested <strong><em>texts</em> and tags in the air

<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";

Standard trim: Trim to 100 characters and HTML breaks, stripped content comes to ~40 characters:




$content = substr($content, 0, 100)."..."; /* output:

<div>With a <span class='spanClass'>span over here</span> and a

<div class='divClass'>nested div ove... */

Stripped HTML: Outputs correct character count but obviously looses formatting:




$content = substr(strip_tags($content)), 0, 100)."..."; /* output:

With a span over here and a nested div over there and a lot of other nested

texts and tags in the ai... */

Partial solution: using HTML Tidy or purifier to close off tags outputs clean HTML but 100 characters of HTML not displayed content.




$content = substr($content, 0, 100)."...";

$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:

<div>With a <span class='spanClass'>span over here</span> and a

<div class='divClass'>nested div ove</div></div>... */

Challenge: To output clean HTML and n characters (excluding character count of HTML elements):




$content = cutHTML($content, 100); /* output:

<div>With a <span class='spanClass'>span over here</span> and a

<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>

</div> and a lot of other nested <strong><em>texts</em> and tags in the

ai</strong></div>...";

6 comments:

UserMay 4, 2012 at 10:29 AM
Not amazing, but works.

function html_cut($text, $max_length)
{
$tags = array();
$result = "";

$is_open = false;
$grab_open = false;
$is_close = false;
$in_double_quotes = false;
$in_single_quotes = false;
$tag = "";

$i = 0;
$stripped = 0;

$stripped_text = strip_tags($text);

while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
{
$symbol = $text{$i};
$result .= $symbol;

switch ($symbol)
{
case '<':
$is_open = true;
$grab_open = true;
break;

case '"':
if ($in_double_quotes)
$in_double_quotes = false;
else
$in_double_quotes = true;

break;

case "'":
if ($in_single_quotes)
$in_single_quotes = false;
else
$in_single_quotes = true;

break;

case '/':
if ($is_open && !$in_double_quotes && !$in_single_quotes)
{
$is_close = true;
$is_open = false;
$grab_open = false;
}

break;

case ' ':
if ($is_open)
$grab_open = false;
else
$stripped++;

break;

case '>':
if ($is_open)
{
$is_open = false;
$grab_open = false;
array_push($tags, $tag);
$tag = "";
}
else if ($is_close)
{
$is_close = false;
array_pop($tags);
$tag = "";
}

break;

default:
if ($grab_open || $is_close)
$tag .= $symbol;

if (!$is_open && !$is_close)
$stripped++;
}

$i++;
}

while ($tags)
$result .= "</".array_pop($tags).">";

return $result;
}

Usage example:

$content = html_cut($content, 100);
ReplyDelete
Replies
UserMay 4, 2012 at 10:29 AM
I'm not claiming to have invented this, but there is a very complete Text::truncate() method in CakePHP which does what you want:

function truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
if (is_array($ending)) {
extract($ending);
}
if ($considerHtml) {
if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
return $text;
}
$totalLength = mb_strlen($ending);
$openTags = array();
$truncate = '';
preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
foreach ($tags as $tag) {
if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
array_unshift($openTags, $tag[2]);
} else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
$pos = array_search($closeTag[1], $openTags);
if ($pos !== false) {
array_splice($openTags, $pos, 1);
}
}
}
$truncate .= $tag[1];

$contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
if ($contentLength + $totalLength > $length) {
$left = $length - $totalLength;
$entitiesLength = 0;
if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
foreach ($entities[0] as $entity) {
if ($entity[1] + 1 - $entitiesLength <= $left) {
$left--;
$entitiesLength += mb_strlen($entity[0]);
} else {
break;
}
}
}

$truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
break;
} else {
$truncate .= $tag[3];
$totalLength += $contentLength;
}
if ($totalLength >= $length) {
break;
}
}

} else {
if (mb_strlen($text) <= $length) {
return $text;
} else {
$truncate = mb_substr($text, 0, $length - strlen($ending));
}
}
if (!$exact) {
$spacepos = mb_strrpos($truncate, ' ');
if (isset($spacepos)) {
if ($considerHtml) {
$bits = mb_substr($truncate, $spacepos);
preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
if (!empty($droppedTags)) {
foreach ($droppedTags as $closingTag) {
if (!in_array($closingTag[1], $openTags)) {
array_unshift($openTags, $closingTag[1]);
}
}
}
}
$truncate = mb_substr($truncate, 0, $spacepos);
}
}

$truncate .= $ending;

if ($considerHtml) {
foreach ($openTags as $tag) {
$truncate .= '</'.$tag.'>';
}
}

return $truncate;
}
ReplyDelete
Replies
UserMay 4, 2012 at 10:29 AM
Use a HTML parser and stop after 100 characters of text.
ReplyDelete
Replies
UserMay 4, 2012 at 10:29 AM
Use PHP's DOMDocument class to normalize an HTML fragment:

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

This question is similar to an earlier question and I've copied and pasted one solution here. If the HTML is submitted by users you'll also need to filter out potential Javascript attack vectors like onmouseover="do_something_evil()" or <a href="javascript:more_evil();">...</a>. Tools like HTML Purifier were designed to catch and solve these problems and are far more comprehensive than any code that I could post.
ReplyDelete
Replies
UserMay 4, 2012 at 10:29 AM
You should use Tidy HTML. You cut the string and then you run Tidy to close the tags.

(Credits where credits are due)
ReplyDelete
Replies
UserMay 4, 2012 at 10:29 AM
Regardless of the 100 count issues you state at the beginning, you indicate in the challenge the following:

output the character count of
strip_tags (the number of characters
in the actual displayed text of the
HTML)
retain HTML formatting close
any unfinished HTML tag

Here is my proposal:
Bascially, I parse through each character counting as I go. I make sure NOT to count any characters in any HTML tag. I also check at the end to make sure I am not in the middle of a word when I stop. Once I stop, I back track to the first available SPACE or > as a stopping point.

$position = 0;
$length = strlen($content)-1;

// process the content putting each 100 character section into an array
while($position < $length)
{
$next_position = get_position($content, $position, 100);
$data[] = substr($content, $position, $next_position);
$position = $next_position;
}

// show the array
print_r($data);

function get_position($content, $position, $chars = 100)
{
$count = 0;
// count to 100 characters skipping over all of the HTML
while($count <> $chars){
$char = substr($content, $position, 1);
if($char == '<'){
do{
$position++;
$char = substr($content, $position, 1);
} while($char !== '>');
$position++;
$char = substr($content, $position, 1);
}
$count++;
$position++;
}
echo $count."\n";
// find out where there is a logical break before 100 characters
$data = substr($content, 0, $position);

$space = strrpos($data, " ");
$tag = strrpos($data, ">");

// return the position of the logical break
if($space > $tag)
{
return $space;
} else {
return $tag;
}
}

This will also count the return codes etc. Considering they will take space, I have not removed them.
ReplyDelete
Replies

Add comment

Ccna final exam - java, php, javascript, ios, cshap all in one

Friday, May 4, 2012

Using PHP substr() and strip_tags() while retaining formatting and without breaking HTML

6 comments:

Total Pageviews