Basics I – Searching

Search Algorithms:

Given a database, the first thing you would want to do is probably search through its entries to see if a particular entry exists. This is why search algorithms are one of the most basic and widely used algorithms, and one of the first algorithms taught in any class.

Linear Search:

The Facebook user database has nearly 800,000,000 entries. To search for an user, a linear search algorithm would check each entry from the beginning to the end to see if there is a match. In the worst case, if the entry we are searching for is at the end of the list, we might have to go through all 800,000,000 entries. Even if checking each entry takes only .000001 secs on an average, total time we would need to search for the entry in the worst case is: 800,000,000 * .000001 = 800 secs or 13.33 mins. Imagine waiting for 13 minutes just to check if your friend is on Facebook!

The advantage of linear search apart from its simplicity is that it can be used irrespective of whether the database is sorted or not.

The disadvantage is obviously that it is extremely time-consuming when we are working with large databases.

Binary Search:

Binary Search drastically reduces the worst-case time required to search through a database. However, it can be used only when the database is sorted i.e. when the entries are arranged either in ascending or descending order.

The Binary Search algorithm:

bool BinarySearch(A, left, right, x):

// A is the array(database) of numbers

// x is the number to search for

// left is the leftmost index(starting index) of the working array

// right is the rightmost index(ending index) of the working array

if (left == right):

if (A[left] == x):

return TRUE;

else:

return FALSE;

else:

int mid = floor[(left+right)/2];

if (A[mid] == x):

return TRUE;

else if (A[mid] < x):

BinarySearch(A, mid+1, n, x);

else if (A[mid] > x):

BinarySearch(A, 1, mid-1, x);

The key to understanding the algorithm is that, since the array is sorted in ascending order, if the middle value is less than x, then x(if it exists) will obviously be among the values to the right of the middle value in the array. And if the middle value is greater than x, then x(if it exists) will be among the values to the left of the middle value. Accordingly, the algorithm recurses on the left half of the array A[1…mid-1] or the right half A[mid+1…n].

The if(left == right)  part checks to see if we have boiled down to only one element. In that case, if that element equals ‘x’ then the algorithm returns TRUE, else we can be sure that ‘x’ does not exist in the array(database) and the algorithm returns FALSE.

Example:

Consider the small database of numbers:

12 43 9 32 89 2 6 22 11 (no. of elements, n = 9 )

Sorting the database using an appropriate sorting algorithm (see <insert link> for Sorting Algorithms),

2 6 9 11 12 22 32 43 89 ( n = 9 )

The algorithm then works as shown in the image below:


Binary Search Example

Time Complexity:

As mentioned previously, Binary Search works a lot faster than the conventional linear search. In fact if we look at the Facebook user database example as discussed previously, Binary search on such a database will require comparisons of the order of only log(800,000,000) ~ 30. So if checking each entry takes .000001 secs as before, time required using Binary Search would be 30 * .000001 = .00003 secs.

Recommended Links:
Advertisements

Solving the Caesar Shift cipher

Caesar Shift is one of the oldest means of encrypted communication. It is a type of substitution cipher. The technique is as follows:

Given a message: CAESAR and a key: 12, we shift each letter of the message by key mod 26 (key%26). Key values should be between 0 – 25 (0 meaning no shift). However, we use mod 26 to accommodate key values greater than 26, so that the mod function maps it back to the range 0 – 25. The shift can be positive or negative, meaning for each letter in the message, we can replace it with a letter that comes ‘key = 12’ places after it or before it respectively.

Positive Caesar shift on message: CAESAR with key: 12:

C → O

A → M

E → Q

S → E

A → M

R → D

So the encoded message is: OMQEMD

Notice that while shifting letters, whenever we reach Z, we circle around from A, as in the case of S and R above.

Similarly, negative Caesar shift would give:

C → Q

A → O

E → S

S → G

A → O

R → F

So the encoded message is: QOSGOF

Notice that a positive encryption key ‘k’ corresponds to a negative encryption key of 26 – k.

In our example above, using a positive key of 12 and a negative key of (26 – 12) = 14 would result in the same encoding: OMQEMD.

There are many variations to the Caesar cipher: In one of them, the keys vary with each letter. The variation of the keys may be represented by a function, like f(key) = (key + 1), where the key increases with each letter.

Here are a few sources on the Caesar cipher:

http://en.wikipedia.org/wiki/Caesar_cipher

http://www.simonsingh.net/The_Black_Chamber/caesar.html

http://www.cs.trincoll.edu/~crypto/historical/caesar.html

The Idea:

We have an encoded message and we do not know the key. We can find either the positive key or the negative key, as the two are related. We also need a dictionary source (a text file consisting of all words in the English language) to help in the decoding process.

The program works as follows:

→ Try all keys from 1 to 25

→ After decoding with a particular key ‘k’ check the dictionary file for the decoded word.

If it exists, it is a legitimate word. Output the word and key ‘k’ and continue.

Else continue without any output.

For large words and sentences, the procedure can be quite time-consuming. Most of the time is used by the program in dictionary lookup.

A faster solution would be to remove the dictionary lookup altogether and let the program output the decoded word for each key k from 1 – 25. The output words can then be checked by the user for legitimacy.

Source Code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

FILE *WORDLIST;
int flag = 0;

int main()
{

	int len;
	int key, i;
	char word[50];

	clock_t start, end;

	WORDLIST = fopen("dictionary.txt", "r");

	system("clear");
	printf("Enter word (in small letters):\n>> ");
	scanf("%s", word);
	len=strlen(word);
	len++;

	char *buffer = malloc(sizeof(100));

	for(key=1; key<26; key++)
	{
		unsigned char temp[len]; 	/* Unsigned is necessary else program fails when value of temp reaches 128 */
		strcpy(temp, word);

		for(i=0; i<len-1; i++)
		{
			temp[i]=word[i]+key;
			if(temp[i]>=123) 	/* If word[i] crosses the character 'z' */
			{
				temp[i]=97+temp[i]-123; /* then replace it with corresponding character from beginning of alphabet */
			}
		}

		while(fscanf(WORDLIST, "%s", buffer) != EOF)
		{
			if(strcmp(buffer, temp) == 0)
			{
				flag = 1;
				printf("Match: %s\n", buffer);
			}
		}

		rewind(WORDLIST);

	}

if(flag == 0) printf("\nNo matches found.\n");

fclose(WORDLIST);

return 0;

}

Ab Initio – A coin problem

This is a problem courtesy of CodeChef.com. The original problem can be found at http://www.codechef.com/problems/COINS. Here is the problem, put simply:

A coin of value ‘n’ can be exchanged for three coins: one of value n/2, one of value n/3 and one of value n/4. All fractions are rounded down (floor). By doing these exchanges, what is the maximum value that can be obtained?

There are many ways to solve this problem:

  • Naive Recursive/Divide and Conquer strategy:
Here are a few sources on the Recursion and the Divide and Conquer paradigm:

The Idea:

Consider n=42. Exchanging n, we get 21, 14, 10, which sum up to 45. But we can do better. A run of the source code below with n = 42, gives the maximum obtainable value as 46. This is when we breakdown 42 as follows:

42

(21, 14, 10)

( (10, 7, 5), 14, 10)

We break down 21 because 10+7+5 = 22 > 21. But we do not break down 14 and 10 because they do not yield any better total value.

Source Code:

#include <stdio.h>

int maxVal(int x);

/* This function returns the maximum total coin value obtainable from a single coin of value 'x' Say we have a coin of value 12. Then according to the problem statement, we can exchange the coin for three coins: 6, 4, 3 which have a sum of 13. So maximum value obtainable from a coin of value 12 is max{12, 13} = 13 */

int main()
{
	int n;
	printf("Enter n:\n");
	scanf("%d", &n);
	printf("Maximum value: %d\n", maxVal(n));

return 0;
}

int maxVal(int x)
{
	if(x==0)
		return 0;
	else if(x==1)
		return 1;
	else
	{
		int temp;
		temp = maxVal(x/2) + maxVal(x/3) + maxVal(x/4);		/* temp holds the total value of the coins we get in exchange for x namely x/2, x/3, x/4. This includes any higher or better amount that we might get by exchanging these coins (x/2, x/3, x/4) themselves*/

		if(x > temp) // Check for max(x, temp) and accordingly return a value
			return x;
		else
			return temp;
	}
}

Problems with this approach:

Can be extremely time-consuming for large values of n. Can take time between minutes to hours for n ~ 10^10 and above.
This is because recursion is an usually an expensive approach in problem solving when the problem size is large and when they have a large number of recurring sub-problems. This problem is even more serious in cases where the same sub-problem may need to be solved repeatedly to be able to solve larger sub-problems. The following website explains recursion in detail and also talks about when and how recursion can be expensive:

http://pages.cs.wisc.edu/~vernon/cs367/notes/6.RECURSION.html

This is where dynamic progamming comes up.

  • Dynamic Programming Approach:

The problem can be solved in much lesser time if we use Dynamic Programming. In short, Dynamic Programming(DP) is a widely used programming paradigm in which a problem is first broken down into subproblems, such that the larger problems can be related to its subproblems and solutions to these subproblems ultimately end up solving the larger problems. This is quite similar to the Divide and Conquer approach. But in DP, we use additional techniques to optimize the time complexity of our algorithms. Here are a few websites that deal with Dynamic Programming:

The Idea:

Source Code:

#include <stdio.h>

int main()
{
	int n;
	scanf("%d", &n);
	long long int i;
	long long int max_val[n+1];

	max_val[0] = 0;
	max_val[1] = 1;

	for(i=2; i<=n; i++)
	{
		if(i > max_val[i/2] + max_val[i/3] + max_val[i/4])
			max_val[i] = i;
		else
			max_val[i] = max_val[i/2] + max_val[i/3] + max_val[i/4];
	}

printf("Maximum value obtainable: %lld\n", max_val[n]);

return 0;
}