Python’s Beautiful Soup

In my last post, I more or less just complained about what a dumpster fire developing anything for Twitter is. Originally, though, the post was intended to be about what I was developing for Twitter. It’s nothing amazing or even complicated, but it was a fun learning experience… Twitter itself not withstanding. I made a Twitter bot tweeting different shades of pink each day. Since that’ll most likely seem nonsensical to most people, let me explain.

Background

A little over a year ago, a friend and I started a podcast. I won’t go into the backstory of why we named it what we did, but the name of the podcast revolved around the color pink due to an inside joke between my friend and I. We ended up publishing 21 episodes in the span of a year before we decided to stop it. I had moved about an hour away from where I previously lived for a new job, so recording in-person involved a decent bit of travel for one of us. Then the coronavirus pandemic really started to take off in my country, and given what a dumpster fire trying to record a podcast remotely is, my friend and I jointly decided to shutter the podcast. It was a fun experience, but nothing either of us were really wanting to keep putting time and money into. As is typically the case, we reached this conclusion just a month after the hosting for both the podcast and our website renewed. Go figure.

That being said, we had set up social media for the podcast, and that social media was now doing exactly nothing. While I didn’t want to do anything with the Facebook or Instagram accounts that my co-host ran (you couldn’t pay me to touch a Facebook property), I thought about what I could do with the lingering Twitter account. I eventually decided to make a simple bot that would tweet a different shade of the color pink each day.

Python and Beautiful Soup

As I mentioned previously, the actual code to post to Twitter ended up being extremely simple. I just used the Twython library to do the heavy lifting. What ended up being more interesting was how to create the database of colors I would use. After all, I don’t personally know that many different shades fo pink, and I wanted to include the RGB and hex color codes for each shade in the daily post. I basically needed a repository of shades of pink. After some DuckDuckGo-fu, I eventually found a page that included not just the RGB and hex color codes, but also a name. It was exactly what I needed.

The only problem was how to get the information from that page into something I could use in my script for the bot. My immediate thought was to copy and paste all of the information, but along with being error-prone over hundreds of shades, that’s also insanely tedious. In a shell script, something like xmllint would fit the bill. Since I was already working in Python, though, I decided to use Beautiful Soup. I had actually used Beautiful Soup one time before on a project years ago where I admittedly didn’t really know Python and most definitely didn’t understand what I was doing with Beautiful Soup; I just ended up copying and pasting a bunch of code from the Internet until things worked the way I wanted.

This time, I took just a little time to read the documentation for Beautiful Soup and understand what I was actually doing. The crux of my script comes down to:

divisions = soup.find_all("div", {"class": "color-inner"})

This gets me each of the div groupings for a color. With each of those groupings defined, it was then simple to get the name, hex, and RGB information I needed:

for division in divisions:
    color_name = division.find("span", {"class": "color-sub"}).get_text()
    color_hex = division.find("span", {"class": "color-id"}).get_text()
    color_rgb = division.find("span", {"class": "color-rgb"}).get_text()

Instead of trying to copy everything by hand, I had a working script to get all of the colors without needing to worry about human error. Plus, if the source website adds any new colors it’s trivial to re-run the script and get an updated list. I ended up making a map for each color and adding all of the maps to a list.

rows.append({"name": color_name, "hex": color_hex, "rgb": color_rgb})

Then I wrapped it all up at the end by exporting the list of maps to a JSON file.

with open('pinks.json', 'w') as outfile:
   json.dump(all_colors, outfile)

My other script which actually pushes the post to Twitter ingests this JSON file and then selects a random shade from it.

Twitter Still Sucks

As if further proof was needed that Twitter is garbage, though, I found myself simultaneously amused and irritated just a few days ago when I saw that a daily post had not been completed. When logging into the account for the bot, I received a notification that the account had been flagged for “suspicious activity”, and I had to walk through a verification process before the account could post again. It’s amazing to me that a platform which tolerates the most hateful and dangerous rhetoric chooses to flag a clear bot that makes a single post each day with details on a different shade of the color pink as “suspicious.” It’s just further proof that Twitter really isn’t worth anyone’s time at this point.

My latest project, though, involves pushing data to Mastodon instead of Twitter. This post will serve as the first test of it, so assuming everything works look for a post on that in the near future.

Twitter Development Impressions

I recently had an idea to turn the Twitter account for my defunct podcast into a Twitter bot posting a new shade of pink every day; it makes sense because the podcast and Twitter account were centered around the color pink. I didn’t really think anyone would care about this particular bot, but it seemed like a fun project idea to work on. It ended up being an interesting learning experience, but not at all for the reasons I actually expected.

Developer Account

Getting a Twitter developer account can be either really simple or really irritating, and there’s no discernable difference that dictates what experience you’ll get. I went to the Developer portal and registered for a developer account with my normal Twitter account. This account is clearly me IRL; my name is in it, I have a photo of myself on the account, it links to my personal website, and the post history clearly indicates that the account is a person rather than a bot. As part of registering for the account, I had to describe what I was going to create. I honestly stated that I was just going to create a bot that would tweet a shade of pink each day. Twitter asks questions such as if you plan to export data out of the service, if you plan to display information posted to Twitter outside of Twitter, etc. I answered “no” to all of these questions since I wasn’t pulling any data out of the service. I just needed to post.

After I completed the registration form, I received a message that my account was under review and I would receive a notification when that review was completed. I was a little bummed since it happened to be a long weekend for me, and I was hoping that this project would give me something to fill the time. I was hopeful maybe the review would be completed quickly. I was wrong. It took just shy of 2 weeks before the review was completed. I had almost forgotten about the whole thing since I’ve been trying to stay off of Twitter as of late, but then I got an email telling me I was allowed in. Wild.

Authentication

Handling authentication with Mastodon is a relatively simple, straightforward process if you’ve done this sort of thing with… pretty much any API. It’s a little different than the type of things I do for work since creating a client means people other than the person writing the code can be authenticating, but it still makes sense and is well documented. On the other hand, authenticating through Twitter is a complete nightmare. Outside of the specifics of authentication, everything in Twitter’s documentation seems aimed at keeping each individual page as short as possible. As a result, every page links to numerous other pages, and you end up having dozens of browser tabs open just to have some clue as to what your complete workflow looks like. For the OAuth 1.0a option, which is the option to use if you need an account other than the registered developer account to leverage an application, they recommend strongly against making the JWT yourself in favor of leveraging a library… but they don’t actually share any of the particulars about their JWT setup… or even call it a JWT. You very clearly get the impression that Twitter doesn’t want anyone actually using their API. Crazy.

After seeing the poor documentation, I abandoned my ideas of making my bot in Rust or maybe Bash, and instead just decided to use Python with the twython library. I’ve manually parsed together enough JWTs that I didn’t think I cared enough to do it again for this. Seeing the workflow for twython showcased the next bit of crazy, though, which is that the authentication workflow sends the OAuth token to the callback URL. I basically needed to set up something completely different with an HTTP listener for the OAuth token so that I could move forwad with authentication. That was entirely more than I wanted to put into this simple bot.

Note: The craziness of this makes me still think I’m not actually understanding the setup properly. I did verify, though, that the response received from where I was running the code included just the HTTP status, so the information is not coming back to the sender by default. Likewise, I couldn’t open my application up to other users without giving a callback URL, so omitting that wasn’t an option, either. Hit me up on Mastodon if I’m just dumb and there’s a reasonable way to handle this that I’m misunderstanding.

Registering Another Developer Account

At this point I realized that I really should have just registered for a developer account with the account I was planning to use with the bot. I started that process again fully expecting to wait another two weeks. I filled out all of the same information during the registration process, but this time when I completed the form I was immediately kicked over to the developer portal to start working on whatever I needed.

I’m still amazed that while the registration for my actual account that I use as a human being, I had to wait two weeks to get developer access. When I registered witht the account that had been used for my podcast, though, I was allowed access sans review. The podcast Twitter account generally had nothing posted to it other than the automatic posts from Squarespace, had the podcast logo as the profile picture, had no website linked, and quite clearly was not a person… yet that’s the account that got in right away. Okie dokie!

Wrap-Up

Since I now had the account I was planning to post from in the developer portal, I could spin up my OAuth token directly from there in my browser as opposed to having to leverage 3-legged OAuth to a callback URL. As I started working on the code, though, I quickly realized that the script for the bot to post was going to be insanely easy. Instead, the much more interesting part of the code was the script I wrote to make a little local repository of information on shades of pink that was completely unrelated to Twitter. I had originally planned to cover that in this post, but since this ended up being longer than I expected that’s what I’ll cover next time around.

Deciphering the Office 365 PSTN Usage Report

Background

Last week I received a bit of a mystifying alert from Office 365 letting me know that our pool of PSTN dial-out minutes for the month had been 80% consumed. This was mystifying due to the fact that as someone who has managed O365 for nearly a decade I didn’t realize there was a pool of minutes. PSTN (Public Switched Telephone Network) had to be related in some way or another to the capability my company has purchased from Office 365 to have a dial-in option added to our Microsoft Teams meetings. Any meeting we create has a dial-in number so that anyone who has a poor data connection can still dial a telephone and at least join the audio portion of the meeting.

While the main function of the Audioconferencing license is to provide that dial-in functionality, dial-out options also exist. For example, while in a Teams meeting if you realize you need someone else to participate you have the option to have Teams dial that person if you provide a phone number. Likewise, if you’re struggling in a low-coverage area you can tap a button in the Teams mobile app to have the service call you.

Sure enough (say what you want about Microsoft, but their documentation for Office 365 tends to be extremely good), I found the official confirmation. It just seems like a bit of a forgotten caveat considering you have to go to the legacy Skype portal to see it; the numbers are completely absent from the Teams portal:

You can monitor the usage against your dial-out minute pool in the “legacy” Skype for Business Admin Center. In the Microsoft Teams Admin Center, navigate to Legacy portal > Reports > PSTN Minute Pools. The Zone A dial-out minute pool will be labeled in the report as “Outbound Calls to Zone A Countries.”

Digging In

When I checked the total number of minutes we had in our pool, I did some quick math to confirm that we were allotted 60 dial-out minutes per Audioconferencing license that was assigned; purchased licenses that were not yet assigned to a user didn’t factor into the equation.

I was still a bit stumped as to who would be eating through our minutes, though, considering I figured dial-out to be a pretty niche feature. Luckily, the reporting section of the legacy Skype portal gives a report of how the minutes are used. There’s even an option to export it for a particular time range. I exported the data on the month-to-date in my tenant, which saved a .csv file. Opening it up showed me that there was still some work to do. The report listed dates for each call, the UPN (userPrincipalName) of the user, a source number, a destination number, and the duration of the call… in seconds. It also included both the dial-out information I wanted and the dial-in information that I don’t care about since that is unlimited. Ick.

I first fired up my text editor and put together a Python script. It weeded out any of the dial-in entries. For all of the dial-out entries, it added the UPN of the user to a list of dictionaries with a count of the seconds for their call. For each entry in the report, it was checked against my list of dictionaries, one for each user. If the user already had a dictionary in the list, the duration of that call was added to the user’s existing total. If the user wasn’t in the list yet, the script would append a new dictionary to the list for that user with the duration of the current call serving as their starting point. So the key for each dictionary was the UPN and the value was the number of seconds. Each dictionary in the list only needed one key-value pair.

I’m not sharing this initial version of the script because while the code worked perfectly, the logic was flawed. The report actually showed that Craft Brew Geek was using significantly more minutes than anyone else. While I was aware of the fact that he was frequently using his phone for Teams meetings while everyone is under quarantine due to his fixed-line provider having issues, I thought he was doing that via LTE rather than PSTN. He described his method of joining meetings to me, and he was literally tapping the “Join” button from the Teams app on his phone. To verify this was using LTE data, I even noted the LTE data consumed by the Teams app on my phone for the month, did a 10 minute Teams audio-only call with Brandi by joining the same way Craft Brew Geek was, and then checking the data usage again. Sure enough, it went up by 10 MB; this process wasn’t touching PSTN at all. I made doubly sure by assigning a policy to just my O365 account in Teams to block dial-out. I verified that tapping the “Join” button on my phone still worked fine while opting for the explicit “Dial me” option ended in an error.

The report also struck me as odd because it only listed 2 days for Craft Brew Geek, despite the fact that I know he’s been joining Teams calls the same way for 2 months now while we all work from home. I went back to the data this time paying attention to the source and destination numbers listed in the report. I noticed for all of Craft Brew Geek’s entries, the number listed as the source number was the number from Microsoft that we selected as the “default” number for our tenant due to it having the closest proximity to the majority of our employees. The destination number was the same every time as well, but it wasn’t Craft Brew Geek’s number. It was the number for a different employee in our company.

This is when I finally realized that the UPN listed in the report is not necessarily the UPN of the user who consumed the minutes; that wouldn’t even necessarily make sense considering you could have a meeting dial-out to an external participant (e.g. if you were using Teams to conduct an interview) or you may not have any telephone information stored for your users in Office 365. Instead, the UPN listed in the report is who scheduled the meeting.

Realizing that my logic was flawed due to the ambiguity of the report, I modified my script and replaced the UPN as the key in each dictionary with the destination telephone number. What I ended up with is this Python script. It’s quick and dirty since I wasn’t trying anything fancy; I just wanted a quick, sorted listing of our dial-out usage.

import csv


usage_tracker = {}

report_file = "./pstn_report.csv"
with open(report_file, "r", newline="") as csvfile:
    reader = csv.DictReader(csvfile)

    for row in reader:
        if row["Call Type"] == "conf_out":
            seconds = int(row["Duration Seconds"])
            if row["Destination Number"] in usage_tracker.keys():
                usage_tracker[row["Destination Number"]] += seconds
            else:
                usage_tracker[row["Destination Number"]] = seconds

usage_tuple_list = sorted(usage_tracker.items(), key=lambda x: x[1], reverse=True)

total_minutes = 0
for element in usage_tuple_list:
    current_minutes = round(element[1] / 60)
    total_minutes += current_minutes
    print(element[0] + ": " + str(current_minutes))

print("============================")
print(total_minutes)

At this point, all I needed to do was figure out who the telephone numbers my script spit out belonged to. Our company maintains a listing of numbers for each employee, so it was simple for me to search it for each entry. This showed us that one employee who was using dial-out heavily for himself as a matter of convenience during the quarantine was using basically all of our minutes.

Aftermath

The main takeaways from this incident are:

  1. Be mindful of the PSTN dial-out pool in Office 365 if you’re doing audioconferencing. I hadn’t thought of it because the O365 instance I managed previously was absolutely massive; this means we had significantly more minutes in the pool for a feature most people never use since the size of the pool is based on license assignment. Keep tabs on it in a smaller organization, especially when everyone is suddenly working remotely.
  2. The report from the legacy Skype admin center is pretty confusing, and it’s easy to mistake who is consuming dial-out minutes. Keep tabs on the destination number(s) in the report to find out who is actually doing it.

That’s it! Stay pink (from a socially acceptable distance, of course.)

Sorting IP Addresses With PowerShell

I was recently attempting to sort a list of IP addresses that I had in a text file. PowerShell is awesome at sorting, so I figured I would use it given that I had nearly 2000 of them. I initially just tried to pipe the contents of the file to the Sort-Object cmdlet:

Get-Content -Path .\ipListFixed.txt | Sort-Object

The results were, suffice to say, lackluster. I ended up with something like this:

10.1.69.1
10.1.69.12
10.1.69.148
10.1.69.149
10.1.69.17

Gross, right? Obviously 10.1.69.17 shouldn’t be after 10.1.69.148 or .149. The issue is that the whole octet isn’t being considered. It’s being sorted like they’re just strings rather than IP addresses; 7 is bigger than 4 so they’re sorted accordingly. PowerShell isn’t really comparing the numbers of 17 to 148, for example, to realize that 17 is actually smaller.

It makes sense; PowerShell can’t assume that the value is an IP address so it’s treating the value like a string instead. This means that the key is to tell PowerShell it’s an IP address so that PowerShell can adjust the way it does sorting. This is possible by casting the values as the .NET Version class. Since I happened to be pulling the IPs from a file, I did the casting within the -Property parameter that I added to the Sort-Object cmdlet:

Get-Content -Path .\ipListFixed.txt | Sort-Object -Property { [System.Version]$_ }

This yields significantly better results:

10.1.69.1
10.1.69.2
10.1.69.12
10.1.69.17
10.1.69.29

Stay pink!

PowerShell: Export Non-Standard and Multi-Value AD Attributes to CSV

I’ve recently been spending a lot of my time at work doing PowerShell scripting for a project that’s currently underway. While working on a bit of code late last week I ran into two interesting problems that I had yet to really stumble across despite doing heavy PowerShell scripting against Active Directory for nearly a decade now.

Exporting Non-standard Attributes To CSV

My first issue was that I needed to export some non-standard attributes from contact objects in AD to a .csv file. When I say “non-standard” I don’t mean that we’ve done a custom schema extension or anything like that. Instead, I simply mean that the AD PowerShell module only wants to include certain attributes when exporting AD objects to a .csv file. For example, if I run Get-ADObject -Properties * against a particular contact object and simply let the output dump to my shell, I see all of the attributes I expect including stuff like givenName and sn. However, if I run the exact same cmdlet but pipe the result to Export-Csv rather than sending it to the shell I will not have all the same attributes; ones like givenName and sn will now be missing. That’s a problem.

Storing Mutli-Value Attributes In a CSV

The other problem was that one of the attributes I needed is proxyAddresses. This is a multi-value attribute as contacts could very likely have multiple proxy addresses associated with a particular mailbox. They may have an alias, x500 addresses, etc. If you use PowerShell to pull this information from AD and dump it to the screen you’ll see the expected proxy address information. If you call GetType() on the proxyAddresses attribute you’ll see it listed as an array. That’s why if you export a contact to a .csv file and include proxyAddresses, the value for each contact in that particular column will simply read:

Microsoft.ActiveDirectory.Management.ADPropertyValueCollection

PowerShell isn’t sure what to do with a multi-value attribute as far as the .csv file goes so it stores the data type. This is also a problem!

The Solution

My solution to this was the following. Note that it’ll be easier to view the Gist for it.

# Get the objects with the required attributes and storing proxyAddresses properly in the .csv file.
Get-ADObject -Filter { objectClass -eq "contact" } -Server myserver.mydomain.net -SearchBase "OU=Contacts,DC=mydomain,DC=net" -SearchScope OneLevel -Properties name,givenName,sn,displayName,telephoneNumber,proxyAddresses,targetAddress,mail,mailNickname,company,department,l,physicalDeliveryOfficeName,postalCode,st,streetAddress,title | Select-Object name,givenName,sn,displayName,telephoneNumber,targetAddress,mail,mailNickname,company,department,l,physicalDeliveryOfficeName,postalCode,st,streetAddress,title,@{name="proxyAddresses"; expression={$_.proxyAddresses -join ";"}} | Export-Csv -Path .\sample.csv -Encoding ASCII -Append -NoClobber -NoTypeInformation

# Shows how to then use the proxyAddresses attribute in the future.
$allContacts = Import-Csv -Path .\sample.csv
foreach($singleContact in $allContacts) {
    $proxyAddresses = $singleContact.proxyAddresses.Split(";")
}

The first thing I needed was to get the properties I actually needed as opposed to just the ones PowerShell felt like including due to the object type. The solution was to:

  1. Specify all of the attributes I wanted in -Properties to ensure they would be available.

  2. Pipe the AD object to Select-Object and then re-specify each of the properties I wanted.

This works because PowerShell is selecting the attributes to include in the .csv file based on the type of object. Select-Object takes whatever comes through the pipeline, though, and makes a brand new, generic object from it. That object will include any properties from that original object that you specify, hence why I tell it the exact listing of properties that I need again.

This leads to the solution to the second problem which is what to do for proxyAddresses. PowerShell allows for the use of an expression to parse together a new property. In this instance the expression is to take the existing proxyAddresses array and -join it into a string. Each of the items in the array are separated by a semicolon. If a contact has the following proxy addresses, for example:

SMTP:pretty@unusually.pink
smtp:awesome@unusually.pink

Then the property in the .csv file will be stored as:

SMTP:pretty@unusually.pink;smtp:awesome@unusually.pink

You could pick any character for the delimiter that you want as long as it isn’t used in other data that you’re storing.

In this instance my goal was to take the exported data, re-import it, and then create contacts in a different AD forest. When creating a new object AD is expecting an array for the proxyAddresses attribute. A string with a random delimiter won’t do it. That’s where the second part of the example comes into play. If I’m looping through all of the data stored in the .csv file I can recreate the array of proxy addresses by splitting the string on my delimiter (in this case a semicolon) and assigning the result to a variable. That variable will be an array which can then be used to popular proxyAddresses when running New-ADObject