What Can Website Categorization Tell Us About Malicious URLs?

The saying “Change is Inevitable” is very apt for the World Wide Web. The Web is constantly evolving, changes are happening in real-time. New URLs are being created, existing URLs are either modified and/or deleted. In this digital age, the Web therefore has become an indispensable part of an user’s life as it is imperative to have a strong online presence to succeed in one’s endeavour, as people nowadays deal mostly online, be it for online banking, e-commerce, gaining knowledge, social networking and the like.

While the Web has become the unequaled application on the Internet, it comes with its colossal share of cyber attacks. Adversaries have been using the Web from time immemorial as a medium to deliver malware by spamming, phishing among others. A URL is typically classified as malicious by AV Vendors, if it is a website created by the threat actors themselves or it is a legitimate website which has been compromised or if the website’s hosting features have been abused to host malicious content. Out of curiosity, K7 Labs researchers decided to perform a cursory investigation of whether it is possible to obtain clues about the nature of such malicious URLs by forcing an evaluation of them by K7’s Machine Learning-based Web Categorization feature.

Note, K7’s Web Categorization feature uses various features of a website such as its content to make automated, intelligent and contextual decisions about its likely purpose so that admins can decide to block or allow classes of websites for employees, e.g. block adult content, sports and shopping but allow news. In a household, parents are the admins who can use the same feature to control what their kids are allowed to access.

In this blog we provide a gist of our findings with respect to known malicious sites categorized by K7’s ML-based Web Categorization system (K7 WebCat).

Here, we have broadly classified the known malicious URLs into “Malware” and “Phishing” categories. Then for each of these we have further grouped them based on its result returned by K7 WebCat.

Need for URL Categorization Based Access

This is needed not only for an organization but also for consumers and parental control to restrict access to certain types of websites, e.g. sites hosting inappropriate content or news or shopping, etc. This is done for various reasons. As mentioned earlier an admin can restrict employees from browsing inappropriate content on the corporate network or parents could block access to improper content for their children. In the corporate context, this would certainly also help the organization in ensuring employees do not waste time, computer resources and bandwidth on sites that are not productive to the organization such as social networking, sports and the like.

Malware

A malware site is typically one that hosts malicious content. Usually users accidentally access these malicious websites or are lured to them by clicking on links in a spam email or SMS or social media post, etc. One common reason could be typosquatting. For instance, the typosquatting variant of Google is ‘Goggle.com’ which is a fraud site. Another important factor is that many of these malicious sites look absolutely legitimate even if they use supposedly secure HTTPS, which may then not raise even an iota of doubt among users. To understand why HTTPS secure sites may not actually be clean, please read our K7 Labs blog on the subject.

Let us now delve deeper into the sub-categories returned for the broad Malware category of URLs.

Unknown: Indicates that the feature content was insufficient to make a decision on the nature of the site. It could be a newly registered domain or a domain which is just a few days old, parked and waiting to be populated. Threat actors frequently register domains to host their malicious content. Though this sub-category is of high risk to users among the other defined categories, a hacked domain could pose a higher risk to users as they might not know that the website they know and trust has been compromised. We would discuss more about hacked domains later in the blog.

Shopping: If this is an online shopping portal that has been compromised, it could mean easy access to your online financial transaction information when you do a purchase.

Social Networking/Download Sites/Streaming Media/Internet Storage: These types of websites are commonly used to host malware by threat actors who abuse the website’s “cloud” content hosting features.

Financial Services: Could be compromised sites associated with Banking, Insurance and the like or could be sites designed to lure unsuspecting would-be customers of “financial services”.

Adult Sites: Pornography sites frequently host both malware and adware. In addition, many adults could be coaxed into revealing their confidential and personal information which can later be used to victimise them.

Games: Children could be easy targets if malware/adware were deployed on a supposed gaming site. Some malware authors may be focusing on children, so an interesting and attractive game is an easy lure.

Health Medicine: These could either be dubious sites offering various “remedies” or, once again, compromised legitimate sites in the Healthcare sector. This sector has become a prime target mainly due to the sensitive personal information they have access to.

Hacked Domains

Just imagine what would happen if your organization’s Web Server is compromised. We shall see the consequences of this, but before that let us see what a Web Server is.

In layman’s terms a web server is a computer hosting one or more websites. So this could imply that all of the organization’s websites are on the same web server, at least in the case of SMEs. So what do you think would be the aftermath of a security breach of such an organization’s web server? The potential fallout could involve:

Loss of credibility and data
Distrust among customers and vendors
Compensation needed to be given to the affected customers among others

Trends

Some of the most impacted sectors during COVID-19 have been the Financial Services and Healthcare sectors. Let us see the impact to those after embracing the Work from Home model, which might also become the “new normal” for us going forward.

Impact to Financial Sector

According to data by Carbon Black Inc., a unit of VMware Inc. that offers cybersecurity technology to financial institutions, attacks against the financial sector increased 238% globally from the beginning of February 2020 to the end of April 2020. If this was the impact during the start of embracing a new model for work, we can imagine the threat scenario right now, which is likely to have been multiplied manifold.

Impact to Healthcare Sector

The Coronavirus fear factor has fuelled fraudulent email scams impersonating legitimate organizations such as the World Health Organization which deliver malware or extract money via other means from hapless victims.

Using fake domain names is common for some of the threats faced by both Financial Services and Healthcare sectors.

Phishing

Phishing websites often appear as an exact replica of a legitimate site to visitors. Here, they are confidence-tricked into providing their credentials or other sensitive information, such as credit card data, to the threat actors. Social engineering is used as a lure for fooling users into clicking the URLs.

In this case, most of the time, the users click on the URL with their full consent, without realising that they have taken the bait and have fallen into the trap laid by the threat actors. These websites can also be accessed by typosquatting explained earlier.

The categorization is similar to that of “Malware”, just that they use a different method to lure and trap users as discussed above. The very high proportion of the “Unknown” sub-category would imply that most phishing URLs are hosted on domains owned by threat actors wherein core areas of the URL may hold insufficient features to classify into known categories.

Conclusion

The Web is being used as a quick and easy way to deliver malware and harass victims. Therefore there is also a need to not only differentiate between malicious and clean URLs but also to categorize them accordingly as a way of restricting access to certain types of websites. Organizations are therefore advised to blacklist such categories, and if this is not possible, do a thorough check on the website and ensure access only if it is clean and appropriate. Users are also requested to do their part by being more aware of such malicious links. We at K7 protect our users by sifting through a large number of URLs on a daily basis and categorizing them accordingly. Enterprise users are advised to install a reputed security product such as K7 Endpoint Security and keep it updated to stay safe from cyber threats. Consumer users can install K7 Ultimate Security and keep it up-to-date to avail of K7’s WebCat feature.

A few additional precautionary measures that all users can take to protect themselves from these fraudulent and dangerous sites are listed below:

Check the URL for incorrect spellings in it
Ensure you access sites that are secure, starting with “https://” and ensure there is a padlock symbol in the address bar. Please note that ‘s’ in “https://” does not guarantee safety. This can only be used as a basic check
Ensure that all the sites that you traverse from the landing page are also secure
Check the site content for grammatical errors and the like
Keep yourself aware and vigilant in order to avoid falling prey to social engineering attacks
Backup critical data

What Can Website Categorization Tell Us About Malicious URLs?

Need for URL Categorization Based Access