Saturday, September 23, 2023

How to find duplicate words in Java String? [Solution Example]

Problem:  Write a Java program to print the duplicate words from a given statement e.g. if the given String is "Java and JavaScript are totally different, JavaScript follows Java" then your program should print "Java" and "JavaScript" because those two are 2 duplicate words from given String. You need to consider all cases e.g. given String can be null, empty, may or may not contain any duplicate words, but for simplicity, you can assume that sentence will always be in English and only use ASCII characters, alphabets, and numerals, no special character.

It's better to get the requirement right of the problem in the beginning even if the interviewer doesn't tell you everything. Directly jumping into a solution without asking a couple of questions may not go well with many interviewers who look for detail-oriented candidates.

If you are practicing these coding problems for an interview, I also suggest you take a look at the Cracking the Coding Interview book. It contains 150 Programming Questions and their Solutions, which is good enough to clear most of the beginner and intermediate programming job interviews.


Solution: In order to find duplicate words, we first need to divide the sentence into words. For that, you can split the String on space using a greedy regular expression, so that it can handle multiple white spaces between words. 

You can use the split() method of java.lang.String class to do that, this method returns an array of words.

Once we list of words, we can insert them into HashSet. Since HashSet doesn't allow duplicate and its add() method returns false if an object already exists in HashSet, we can find all duplicate words.

Just loop over an array, insert them into HashSet using add() method, check the output of add() method. If add() returns false then it's a duplicate, print that word to the console.

This is also one of the top 20 String-based problems from interviews. You can see that article to more coding problems based upon String.

One of the follow-up questions of this is how do you find the number of times each duplicate word has appeared in a sentence? For example, in our coding problem, your solution should also print the count of both Java and JavaScript e.g. Java: 2 and JavaScript: 2 because they have appeared twice in a sentence.



You can solve this problem by choosing another hash-based data structure like a hash table, which maintains key-value pair. Java provides several implementations of hash table data structures e.g. HashMap, Hashtable, and ConcurrentHashMap, but for general purposes, HashMap is good enough.

In short, just use HashMap instead of HashSet to keep the count of duplicate words in the sentence. This is also similar to the problem of finding duplicate characters in String. Instead of character, you need to find duplicate words, as shown here.

Another follow-up question related to this problem is how do you remove duplicate words from String in Java? Which is actually the same problem of removing duplicate elements from an array? If you know how to solve that, you can easily solve this one as well. If you face any problem,  see this solution.

How to find duplicate words in Java String


Java Program to find duplicate words in String - Example

Here is our solution to the problem of finding duplicate words in a sentence in Java. I have used HashSet to find duplicates. The time complexity of this solution is O(n) because we need to iterate over all elements in the array

You also need a buffer of the same size as the original array, hence, the space complexity is also O(n), so it may not be suitable for a really long String. You need more memory to find even a single duplicate word if your String is huge.

import java.util.Collections;
import java.util.HashSet;
import java.util.Set;

/**
 * Java Program to demonstrate how to find duplicate words in String.
 */
public class DuplicateWordsInString{

    public static void main(String[] args) {
        String test = "This sentence contains two words, one and two";
        Set<String> duplicates = duplicateWords(test);
        System.out.println("input : " + test);
        System.out.println("output : " + duplicates);
    }


    /**
     * Method to find duplicate words in a Sentence or String
     * @param input String 
     * @return set of duplicate words
     */
    public static Set<String> duplicateWords(String input){
        
        if(input == null || input.isEmpty()){
            return Collections.emptySet();
        }
        Set<String> duplicates = new HashSet<>();
        
        String[] words = input.split("\\s+");
        Set<String> set = new HashSet<>();
        
        for(String word : words){
            if(!set.add(word)){
                duplicates.add(word);
            }
        }
        return duplicates;
    }
    
    
}

Output :
input : This sentence contains two words, one and two
output : [two]

From the output it's clear that our program is working as expected, It right prints that "two" is the only duplicate word in a given String. Nonetheless, we are going to write some unit tests to further test our solution for different input values.



JUnit tests to find duplicate words in Java String

Here is my list of JUnit test classes for our solution. We are going to test our solution for empty String, null String, String with only duplicates, String without any duplicates, and String which contains multiple spaces between words.  Each JUnit tests one input. If your input set is large then you can also consider using parameterized JUnit test.

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;

import java.util.Collections;
import java.util.Set;

import org.junit.Test;

public class DuplicateWordsInStringTest {    
   
    @Test
    public void testWithEmptyString(){        
        Set<String> output = DuplicateWordsInString.duplicateWords("");
        assertEquals(Collections.emptySet(), output);
    }
    
    @Test
    public void testWithNullString(){
        Set<String> output = DuplicateWordsInString.duplicateWords(null);
        assertEquals(Collections.emptySet(), output);
    }
    
    @Test
    public void testWithDuplicateString(){
        Set<String> output = DuplicateWordsInString
                                     .duplicateWords("one one one two two");
        assertTrue(output.contains("one"));
        assertTrue(output.contains("two"));
        assertTrue(output.size() == 2);
    }
    
    @Test
    public void testWithOutDuplicates(){
        Set<String> output = DuplicateWordsInString
                                      .duplicateWords("one two three");
        assertEquals(Collections.emptySet(), output);
    }
    
    @Test
    public void testWithMultipleSpaceBetweenWord(){
        Set<String> output = DuplicateWordsInString
                                     .duplicateWords(" one   two    three ");
        assertEquals(Collections.emptySet(), output);
    }
    
    
}


That's all about how to find duplicate words in a given String in Java. We have used HashSet data structure to solve this problem and our solution has time and space complexity of O(n). For a curious developer, can you come up with a solution with better time and space complexity? How about a solution with time complexity in order of O(k) where k is duplicate words? or O(logN)?

And now let's see a simple quiz for you. What is your favorite sorting algorithm in Java? Quicksort, Merge sort, Heap sort, Selection sort, Insertion sort, Radix sort, or  this one?

1 comment:

  1. This is the same technique anyone can use to find duplicates on array, String or matrix, or any other data structure, just keep storing and Set will tell you which element is duplicate. The real problem comes when Set is not allowed and you are asked to solve the problem using pure Array, in that case you can use set bit pattern to mark the location with 1 to indicate it is used already.

    ReplyDelete