Lessons from building a Carousell Alerting System

Zing Zai
5 min readMay 7, 2020

1) TLDR lessons;

1 minute impression

I found the DIY version too long… It will take an average user around 10–15 minutes to create a Telegram bot and host the alerter on Heroku.

With the goal for users to receive alerts immediately, I created a Telegram bot (@caroualert_bot) and integrate it to the Cron job below. They can add items to monitor and @caroualerter_bot will send the updates.

2) DIY lessons;

Puppeteer

Using Puppeteer and Cron

What is Puppeteer and what I learnt?

Puppeteer is a Node.js library maintained by Google’s Chrome team.
It allows us to use its APIs to interact with Chrome DevTools, headlessly!

Apparently, browsers have heads (UI) and headless browsers much have faster performance. I learnt that we could interact with Chrome DevTools without even opening a browser. Cool stuff, shallow learning curve.

What is Cron and what I learnt?

Cron is a JavaScript tool that allows you to execute us to run a job on a schedule. It is done using the cron syntax.

0 0 0/1 1/1 * ? *
| | | | | | |
| | | | | | +-- Year (range: 1970-2099)
| | | | | +---- Day of the Week (range: 1-7 or SUN-SAT)
| | | | +------ Month of the Year (range: 0-11 or JAN-DEC)
| | | +--------- Day of the Month (range: 1-31)
| | +------------- Hour (range: 0-23)
| +---------------- Minute (range: 0-59)
+------------------ Second (range: 0-59)

I found the Cron syntax really confusing, but found this great explanation by Quartz. With this understanding, I was able to better control when I want my Cron jobs to run. Besides Unix’s Crontab, I’ve seen it being used in Nifi. So it might be useful to take some time how this syntax works.

Before using Puppeteer…

Before Puppeteer, I actually wrote a simple Python bot with Requests, BeautifulSoup4 (BS) and Pandas. Requests returns HTML content, BS parses the content and Pandas manipulates the content.
RECOMMEND: First time with Pandas’ dataframes. Powerful; Easy to learn;

Insufficient info

However, upon more careful investigation, I realised that the first response of the Carousell listing page only returned very basic info (i.e. listing name, seller’s name). I wanted more info such as image links so that I do not have to open the app manually when I see an update. Telegram will process the URL link for me and I’ll be able to see the image instantly.

Difficult to tranverse HTML and subjected to change

I find it troublesome to traverse HTML elements, don’t you? Also, HTML elements are more prone to change than JavaScript objects. I would prefer to work with an object is well-structured such as the browser window object.

SearchListing object from window.initialState

With the limitations above, I needed another strategy. With a little more analysis work, I found out that the window.initialState object has all the info I needed and went with Puppeteer.
If you’re wondering why Puppeteer, not Selenium… at one point, you’ll have to stop comparing, decide on one and go with it. Re-evaluate it after some time.

Deploying onto Heroku

Not everyone has the time to develop an application and I wanted it to be a one-click deployment. No fuss, no hassle where anyone can use almost immediately.

This time, I studied and applied more configs for on app.json:
1) Env configs: Unlike my previous Heroku deployments where I needed users to manually go into the Settings to add their config vars, I prompt them at the deployment stage.
2) Buildpacks: I had similar issues with deploying Puppeteer onto Heroku Stack 18 (latest). I needed to downgrade the Heroku Stack to 16 and include a custom buildpack.
3) Formation: By default, Heroku spins a web dyno. I wanted to use a worker dyno which is meant for background jobs and queues.

Improvements

Optimised Performance

By default, Puppeteer loads JavaScript, stylesheet, media, font etc resources. Optimisation was done to only retrieve document type resources. All other resources were aborted. Performance boost to ~1 sec.

Improved UX

Instead of having the user do a manual search, copy and paste in the URL. Now, they just have to enter the item search item. Encoding and parsing is automatically done for them.

Narrowed search listings

Unrelated listings from advertisements popped out after a while. Now, when Puppeteer launches, the cache is disabled. New sessions do not have advertisements in them, so we are quite safe.

After some monitoring, I notice that resellers’ advertisements still appear on first load. We now automatically filter bumpers and spotlighters.

Migrated from mLab to MongoDB Atlas

“mLab is shutting down its Heroku add-on on November 10, 2020. You will lose access to your data unless you detach from Heroku or migrate to MongoDB Atlas before then.”

29 Jul 2020: This guide made it really easy to migrate from mLab to Atlas!

What’s Next…

I also learnt that there wasn’t a need for a big-ass plan. Things naturally fall into place and I started to discover where gaps were.

Data Analytics

I relied on Telegram’s search function (Yeppie to not needing to integrate a search engine myself!) to search for past listings to answer questions such as:
1. What is the market price for this product? Am I under/over-paying?
Am I getting it at a bargain? (Impetus for this bot)
2. How popular is the listed product? (Difficult question to answer with functions of the original site as I wasn’t able to search for SOLD, DELETED)

Despite being able to answer the above questions using Telegram, I’m still analysing things at a micro level. Leveraging of data analytics would piece things together to give the user a better picture of things he/she is monitoring.

--

--