Shub's logo

Word Counter

26 Apr, 2020

3 min read

This is a fun script, that I am trying out to learn rust. Also benefitting myself, with a script that will automate to get an approx. count of words from my MDX notes, which later on will help me to get the read time per notes.

The Word Counter code is written in 2hrs time, and is not optimal. It uses brute force to remove templates matched by different regexexpressions, one by one in sequence. I will someday improve this script, to use AST for better performance and accurate results.

References to read later:

Code Breakup

First of all, I am using regex crate, which is official crate support by Rust.

Using regex is simple and it uses RE2 syntax, which is a superset for Javascript RegExp.

Reading data from stdin, required as we will use our build as a pipe function later:

use std::io::{self, Read};
...
let mut buffer = String::new();
io::stdin().read_to_string(&mut buffer)?;
...

Now we need few regular expressions to be pre-compiled, so as to use them later

1
2
3
4
5
6
7
8
9
use regex::{Regex, RegexBuilder, Captures};
...
let re1 = Regex::new(r"<\s*[^>]*>(.*?)<\s*/.*\s*>").unwrap();
let re2 = Regex::new(r"<\s*[^>]*/?>|<\s*/.*\s*>").unwrap();
let re3 = RegexBuilder::new(r"^(`{3}\w)[^`]*(`{3})$")
    .multi_line(true)
    .dot_matches_new_line(true)
    .build()
...

Line 3. creates a regex for all HTML Container Tags, extracting out the inner strings, so as to word count them later.

Line 4. creates a regex from all HTML Empty Tags and previously uncaught Opening/Closing Tags.

Line 5. helps to create a multi-line flagged regexp, so as to match all the Code Blocks.

Note: Code Diff Blocks are still remaining to be checked. Currently the content inside Code Diff block will be part of Word counts.

All the above regex will later on be used to replace their matches with empty string, so as to remove them. 🎉 ✌️

Some more regex:
let return_to_space = Regex::new(r"\n").unwrap();
let space_re = Regex::new(r"\s").unwrap();

return_to_space regex will be used to remove all the remaining returns from the output string, that we will get, using previous regex.

space_re regex, will later on be used to filter out extra spaces, so as to improve word count accuracy.

Final Regex to extract Words
let word_counter_re = RegexBuilder::new(r"[^\s#\*]*")
  .multi_line(true)
  .dot_matches_new_line(true)
  .build()
  .unwrap();

Do note, above regex is not for pure words, as in our mdx files, we have inline code-snippets, and more stuff, which breaks if \w+ is directly used. Thus we needed to get all chars which are not space, # and *. I know, this regex might be incomplete, or needs more chars for proper filter, but for now it's fine as we need approximate results, not accurate.


Finally, using the above regex and filtering our extra spaces, we can get an approximate count of all the words present in our Notes, so as to later on generate Read Time for our Notes.


You can find full code, for this note, here.


© Copyright 2020 Subroto Biswas

Share