Pulling data from text with nested parentheses in PHP

[DRAFT]

The more accurate title of this post might be:
Extracting data from a list of items which have sub-items in parentheses while avoiding inappropriately splitting nested parentheses (which show sub-item information) into their own sub-items.
I am working on importing aircraft models into a database.

The data come in a format like in this string:

$str = "Aerospace Technologies (N22B, N24A) Agusta (A109AII) Bell
 (47, 47B, 47B3, 47D, (USAF H-13B), 47D1 (USAF H-13D, H-13E, Navy
 HTL-4, -5), 47E (Navy HTL-3), 47G (USAF H-13G), 47G-2A, -2A-1,
 47-G2 (USAF H-13H), -3 (OH-13H), -3B, -3B-1 (TH-13T), -3B-2,
 -3B-2A, -4, -4A, -5, -5A, 47H-1, 47J, -2, -2A, 47K)";

Note that the Bell section has nested parentheses:

 Bell (47, 47B, 47B3, 47D, (USAF H-13B), 47D1 (USAF H-13D, H-13E, Navy
 HTL-4, -5), 47E (Navy HTL-3), 47G (USAF H-13G), 47G-2A, -2A-1, 47-G2
 (USAF H-13H), -3 (OH-13H), -3B, -3B-1 (TH-13T), -3B-2, -3B-2A, -4,
 -4A, -5, -5A, 47H-1, 47J, -2, -2A, 47K)

The problem with simple regular expressions is that it can be hard to deal with the nested parentheses. There are ways you can use recursion in regular expressions but I don’t know much about that.

After a whole lot of time staring at the screen I decided to convert nested parentheses into brackets. Then I would also convert any commas that appear in the brackets to semicolons.

eg:
(USAF H-13D, H-13E, Navy HTL-4, -5)
from above should become
[USAF H-13D; H-13E; Navy HTL-4; -5]
I want to keep the other commas in tact so I can use explode(“,”…) to split the models into an array.

This function will do the first part:

function bracketize_nested_parentheses( $str )
{
    // Replaces nested parentheses with brackets.
    $regex = '/(\([^()]+)(\()([^()]+)(\))/';

    $m_fix = $str;
    $count=1;
    while( $count <> 0 )
        $m_fix = preg_replace( $regex , '\\1[\\3]', $m_fix, -1, $count );

    return $m_fix;
}

I did this because there are no brackets in the source data so I can safely convert them back to parentheses later.

Then with this function I replaced any commas that appear in brackets with semicolons (which also do not appear in the source data so I can convert them back to commas later).

function replace_commas_in_brackets( $str )
{
    // Replaces commas in brackets with semicolons.
    $regex = '/(\[[^\],]+),/';

    $m_fix = $str;
    $count=1;
    while( $count <> 0 )
        $m_fix = preg_replace( $regex , '\\1;', $m_fix, -1, $count );

    return $m_fix;

}

 

If you know of another method (especially a more efficient method) please do leave a comment.

If you are interested from here I used explode( “)”…) and explode(“(“…) to split up the data further.

Leave a Reply

Your email address will not be published. Required fields are marked *

To see if you are a human *