Thursday, April 12, 2012

PHP: Truncate HTML, ignoring tags


I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).




substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."



Would result in:




Hello, my <strong>name</st...



What I would want is:




Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...



How can I do this?



While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).



Also note that I have included an HTML entity &acute; - which would have to be considered as a single character (rather than 7 characters as in this example).



strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.


Source: Tips4all

10 comments:

  1. Assuming you are using XHTML, it's not too hard to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".

    <?php
    header('Content-type: text/plain');

    function printTruncated($maxLength, $html)
    {
    $printedLength = 0;
    $position = 0;
    $tags = array();

    while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
    list($tag, $tagPosition) = $match[0];

    // Print text leading up to the tag.
    $str = substr($html, $position, $tagPosition - $position);
    if ($printedLength + strlen($str) > $maxLength)
    {
    print(substr($str, 0, $maxLength - $printedLength));
    $printedLength = $maxLength;
    break;
    }

    print($str);
    $printedLength += strlen($str);

    if ($tag[0] == '&')
    {
    // Handle the entity.
    print($tag);
    $printedLength++;
    }
    else
    {
    // Handle the tag.
    $tagName = $match[1][0];
    if ($tag[1] == '/')
    {
    // This is a closing tag.

    $openingTag = array_pop($tags);
    assert($openingTag == $tagName); // check that tags are properly nested.

    print($tag);
    }
    else if ($tag[strlen($tag) - 2] == '/')
    {
    // Self-closing tag.
    print($tag);
    }
    else
    {
    // Opening tag.
    print($tag);
    $tags[] = $tagName;
    }
    }

    // Continue after the tag.
    $position = $tagPosition + strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < strlen($html))
    print(substr($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
    printf('</%s>', array_pop($tags));
    }


    printTruncated(10, '<b>&lt;Hello&gt;</b> <img src="world.png" alt="" /> world!'); print("\n");

    printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");

    printTruncated(10, '<em><b>&lt;Hello&gt;</b>&#20;world!</em>'); print("\n");


    Edit: Updated to handle entities as well.

    ReplyDelete
  2. 100% accurate, but pretty difficult approach:


    Iterate charactes using DOM
    Use DOM methods to remove remaining elements
    Serialize the DOM


    Easy brute-force approach:


    Split string into tags (not elements) and text fragments using preg_split('/(<tag>)/') with PREG_DELIM_CAPTURE.
    Measure text length you want (it'll be every second element from split, you might use html_entity_decode() to help measure accurately)
    Cut the string (trim &[^\s;]+$ at the end to get rid of possibly chopped entity)
    Fix it with HTML Tidy

    ReplyDelete
  3. The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an href-attribute of an <a>-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.

    function substr_html($string, $length)
    {
    $count = 0;
    /*
    * $state = 0 - normal text
    * $state = 1 - in HTML tag
    * $state = 2 - in HTML entity
    */
    $state = 0;
    for ($i = 0; $i < strlen($string); $i++) {
    $char = $string[$i];
    if ($char == '<') {
    $state = 1;
    } else if ($char == '&') {
    $state = 2;
    $count++;
    } else if ($char == ';') {
    $state = 0;
    } else if ($char == '>') {
    $state = 0;
    } else if ($state === 0) {
    $count++;
    }

    if ($count === $length) {
    return substr($string, 0, $i + 1);
    }
    }
    return $string;
    }

    ReplyDelete
  4. Could use DomDocument in this case with a nasty regex hack, worst that would happen is a warning, if there's a broken tag :

    $dom = new DOMDocument();
    $dom->loadHTML(substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26));
    $html = preg_replace("/\<\/?(body|html|p)>/", "", $dom->saveHTML());
    echo $html;


    Should give output : Hello, my <strong>**name**</strong>.

    ReplyDelete
  5. This is very difficult to do without using a validator and a parser, the reason being that imagine if you have

    <div id='x'>
    <div id='y'>
    <h1>Heading</h1>
    500
    lines
    of
    html
    ...
    etc
    ...
    </div>
    </div>


    How do you plan to truncate that and end up with valid HTML?

    After a brief search, I found this link which could help.

    ReplyDelete
  6. I've made light changes to Søren Løvborg printTruncated function making it UTF-8 compatible:

    /* Truncate HTML, close opened tags
    *
    * @param int, maxlength of the string
    * @param string, html
    * @return $html
    */
    function html_truncate($maxLength, $html){

    mb_internal_encoding("UTF-8");

    $printedLength = 0;
    $position = 0;
    $tags = array();

    ob_start();

    while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){

    list($tag, $tagPosition) = $match[0];

    // Print text leading up to the tag.
    $str = mb_strcut($html, $position, $tagPosition - $position);

    if ($printedLength + mb_strlen($str) > $maxLength){
    print(mb_strcut($str, 0, $maxLength - $printedLength));
    $printedLength = $maxLength;
    break;
    }

    print($str);
    $printedLength += mb_strlen($str);

    if ($tag[0] == '&'){
    // Handle the entity.
    print($tag);
    $printedLength++;
    }
    else{
    // Handle the tag.
    $tagName = $match[1][0];
    if ($tag[1] == '/'){
    // This is a closing tag.

    $openingTag = array_pop($tags);
    assert($openingTag == $tagName); // check that tags are properly nested.

    print($tag);
    }
    else if ($tag[mb_strlen($tag) - 2] == '/'){
    // Self-closing tag.
    print($tag);
    }
    else{
    // Opening tag.
    print($tag);
    $tags[] = $tagName;
    }
    }

    // Continue after the tag.
    $position = $tagPosition + mb_strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < mb_strlen($html))
    print(mb_strcut($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
    printf('</%s>', array_pop($tags));


    $bufferOuput = ob_get_contents();

    ob_end_clean();

    $html = $bufferOuput;

    return $html;

    }

    ReplyDelete
  7. Bounce added multi-byte character support to Søren Løvborg's solution - I've added:


    support for unpaired HTML tags (e.g. <hr>, <br> <col> etc. don't get closed - in HTML a '/' is not required at the end of these (in is for XHTML though)),
    customisable truncation indicator (defaults to &hellips; i.e. … ),
    return as a string without using output buffer, and
    unit tests with 100% coverage.


    All this at Pastie.

    ReplyDelete
  8. I used a nice function found at http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words, apparently taken from CakePHP

    ReplyDelete
  9. Another light changes to Søren Løvborg printTruncated function making it UTF-8 (Needs mbstring) compatible and making it return string not print one. I think it's more useful.
    And my code not use buffering like Bounce variant, just one more variable.

    UPD: to make it work properly with utf-8 chars in tag attributes you need mb_preg_match function, listed below.

    Great thanks to Søren Løvborg for that function, it's very good.

    /* Truncate HTML, close opened tags
    *
    * @param int, maxlength of the string
    * @param string, html
    * @return $html
    */

    function htmlTruncate($maxLength, $html)
    {
    mb_internal_encoding("UTF-8");
    $printedLength = 0;
    $position = 0;
    $tags = array();
    $out = "";

    while ($printedLength < $maxLength && mb_preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
    list($tag, $tagPosition) = $match[0];

    // Print text leading up to the tag.
    $str = mb_substr($html, $position, $tagPosition - $position);
    if ($printedLength + mb_strlen($str) > $maxLength)
    {
    $out .= mb_substr($str, 0, $maxLength - $printedLength);
    $printedLength = $maxLength;
    break;
    }

    $out .= $str;
    $printedLength += mb_strlen($str);

    if ($tag[0] == '&')
    {
    // Handle the entity.
    $out .= $tag;
    $printedLength++;
    }
    else
    {
    // Handle the tag.
    $tagName = $match[1][0];
    if ($tag[1] == '/')
    {
    // This is a closing tag.

    $openingTag = array_pop($tags);
    assert($openingTag == $tagName); // check that tags are properly nested.

    $out .= $tag;
    }
    else if ($tag[mb_strlen($tag) - 2] == '/')
    {
    // Self-closing tag.
    $out .= $tag;
    }
    else
    {
    // Opening tag.
    $out .= $tag;
    $tags[] = $tagName;
    }
    }

    // Continue after the tag.
    $position = $tagPosition + mb_strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < mb_strlen($html))
    $out .= mb_substr($html, $position, $maxLength - $printedLength);

    // Close any open tags.
    while (!empty($tags))
    $out .= sprintf('</%s>', array_pop($tags));

    return $out;
    }

    function mb_preg_match(
    $ps_pattern,
    $ps_subject,
    &$pa_matches,
    $pn_flags = 0,
    $pn_offset = 0,
    $ps_encoding = NULL
    ) {
    // WARNING! - All this function does is to correct offsets, nothing else:
    //(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)

    if (is_null($ps_encoding)) $ps_encoding = mb_internal_encoding();

    $pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
    $ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);

    if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
    foreach($pa_matches as &$ha_match) {
    $ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
    }

    return $ret;
    }

    ReplyDelete
  10. I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.

    /**
    * function to truncate and then clean up end of the HTML,
    * truncates by counting characters outside of HTML tags
    *
    * @author alex lockwood, alex dot lockwood at websightdesign
    *
    * @param string $str the string to truncate
    * @param int $len the number of characters
    * @param string $end the end string for truncation
    * @return string $truncated_html
    *
    * **/
    public static function truncateHTML($str, $len, $end = '&hellip;'){
    //find all tags
    $tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i'; //match html tags and entities
    preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );
    //WSDDebug::dump($matches); exit;
    $i =0;
    //loop through each found tag that is within the $len, add those characters to the len,
    //also track open and closed tags
    // $matches[$i][0] = the whole tag string --the only applicable field for html enitities
    // IF its not matching an &htmlentity; the following apply
    // $matches[$i][1] = the start of the tag either '<' or '</'
    // $matches[$i][2] = the tag name
    // $matches[$i][3] = the end of the tag
    //$matces[$i][$j][0] = the string
    //$matces[$i][$j][1] = the str offest

    while($matches[$i][0][1] < $len && !empty($matches[$i])){

    $len = $len + strlen($matches[$i][0][0]);
    if(substr($matches[$i][0][0],0,1) == '&' )
    $len = $len-1;


    //if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting
    //ignore empty/singleton tags for tag counting
    if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){
    //double check
    if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')
    $openTags[] = $matches[$i][2][0];
    elseif(end($openTags) == $matches[$i][2][0]){
    array_pop($openTags);
    }else{
    $warnings[] = "html has some tags mismatched in it: $str";
    }
    }


    $i++;

    }

    $closeTags = '';

    if (!empty($openTags)){
    $openTags = array_reverse($openTags);
    foreach ($openTags as $t){
    $closeTagString .="</".$t . ">";
    }
    }

    if(strlen($str)>$len){
    //truncate with new len
    $truncated_html = substr($str, 0, $len);
    //add the end text
    $truncated_html .= $end ;
    //restore any open tags
    $truncated_html .= $closeTagString;


    }else
    $truncated_html = $str;


    return $truncated_html;
    }

    ReplyDelete