03_html

HTML modification

prerequisites

hyphenate a beautiful soup


source

hyphenate_soup

 hyphenate_soup (soup:bs4.BeautifulSoup,
                 hyphenator:collections.abc.Callable[[str],str], exclude_c
                 lasses:tuple[typing.Type[bs4.element.PageElement],...]=(<
                 class 'bs4.element.PreformattedString'>, <class
                 'bs4.element.Stylesheet'>, <class 'bs4.element.Script'>,
                 <class 'bs4.element.RubyTextString'>, <class
                 'bs4.element.RubyParenthesisString'>))

Call hyphenator on words that appear in suitable elements of soup, and replace the contents of those elements. Suitable elements are those containing text whose class is not (a subclass of something) in exclude_classes.

Type Default Details
soup BeautifulSoup soup to be modified
hyphenator Callable hyphenator
exclude_classes tuple (<class ‘bs4.element.PreformattedString’>, <class ‘bs4.element.Stylesheet’>, <class ‘bs4.element.Script’>, <class ‘bs4.element.RubyTextString’>, <class ‘bs4.element.RubyParenthesisString’>) do not modify inside these
Returns None

soup analysis

The following function consumes many beautiful soups and counts the words in them. Ways to use it include: - list all the words longer than a threshold and see if they are compounds - list all the frequent words and see if they are hyphenated right - list all the words of low frequency and see if they are typos


source

analyze_soups

 analyze_soups (dinner:list[bs4.BeautifulSoup],
                exclude_classes:tuple[typing.Type[bs4.element.PageElement]
                ,...]=(<class 'bs4.element.PreformattedString'>, <class
                'bs4.element.Stylesheet'>, <class 'bs4.element.Script'>,
                <class 'bs4.element.RubyTextString'>, <class
                'bs4.element.RubyParenthesisString'>))

Count words appearing in all soups.

Type Default Details
dinner list soups to be read
exclude_classes tuple (<class ‘bs4.element.PreformattedString’>, <class ‘bs4.element.Stylesheet’>, <class ‘bs4.element.Script’>, <class ‘bs4.element.RubyTextString’>, <class ‘bs4.element.RubyParenthesisString’>) do not look inside these
Returns Counter

It can be interesting to look at the longest words on some pages, perhaps to check if hyphenation exceptions are needed for them:

soups = [
    bs4.BeautifulSoup(r.content, 'lxml')
    for r in map(requests.get, [
        'https://en.wikipedia.org/wiki/Syllabification',
        'https://en.wikipedia.org/wiki/Hyphen'
    ])
]
c = analyze_soups(soups)
[(w, c[w]) for w in c.keys() if len(w)>13]
[('syllabification', 19),
 ('conventionally', 2),
 ('correspondence', 1),
 ('implementation', 1),
 ('heterosyllabic', 1),
 ('disambiguation', 2),
 ('prescriptivist', 1),
 ('recommendations', 2),
 ('representatives', 1),
 ('misinterpretable', 1),
 ('misunderstanding', 1),
 ('misinterpreted', 2),
 ('past_participled', 1),
 ('differentiated', 1),
 ('recommendation', 1),
 ('sociolinguistics', 2),
 ('reduplicatives', 1),
 ('implementations', 1),
 ('indistinguishable', 1),
 ('standardization', 1),
 ('reinterpretations', 1),
 ('interpretation', 1),
 ('capitalization', 1),
 ('classifications', 1),
 ('microtypography', 1),
 ('phototypesetting', 1),
 ('classification', 1),
 ('multiplication', 1),
 ('srpskohrvatski', 1),
 ('српскохрватски', 1)]

Or the most common words:

c.most_common(20)
[('the', 339),
 ('a', 215),
 ('of', 190),
 ('in', 174),
 ('and', 159),
 ('is', 135),
 ('to', 132),
 ('hyphen', 131),
 ('for', 84),
 ('as', 79),
 ('or', 75),
 ('be', 65),
 ('are', 61),
 ('this', 57),
 ('that', 56),
 ('with', 55),
 ('used', 53),
 ('hyphenation', 46),
 ('hyphens', 45),
 ('word', 41)]