Tag Archives: Hash table

GSoC 2014 Update : Caching Plugin for Monkey server – Week 2

This the update of the work I did during the my second week of GSoC internship.

What I did this week :

  1. Invested a lot of time in trying to figure out the working of the system call mmap (). I tried out sample programs to map files onto the memory and reading them from the memory.
  2. Once I was quite clear about the functioning of mmap, I then tried to implement mmap in my hash table. I decoupled the current hash table and min heap implementations that I have written.
  3. I then included mapping of files that got inserted into the hash table, and storing the memory location where they got stored, along with the name of the file in the hash table. I also implemented the lookup of a file  – if the file was found in the hashtable, I read the file from the particular memory location that has been stored besides its name (that is the memory location to which it was mmapped while newly inserting it).
  4. After testing ‘mmap’ing with hash table structure, I integrated the hash table and min heap again.

Problem faced :

Initially, I faced problem while reading  a file from a particular memory location using mmap, but solved it.

Yet another issue that still exists :

How to persist the hash table for future use ?

I searched about how I can make hash tables persistent, so that the next time I run the program, the table still contains the previously inserted files and their memory locations. I came up with two rough ideas (I do not know if they’re right) :
  1. Write the hash table to a file and can be loaded for later use.
  2. Use of persistent data structures : I read about it at [2] and [3]

Roadmap :

  1. The aim is to now complete hash table persistence.
  2. Write test cases for the hash table.
My learnings :
Some things that I learnt while working on this are roughly drafted below :
  • ‘mmap’ing of files to the memory means that it is just allocated some space on the memory. It is just reserved space on the memory. It can be accessed through a pointer that was returned.
  • Each process has a unique view of the virtual memory. It is given the feeling that it works with a huge amount of contiguous memory locations while in reality this is not true. Whenever a process begins to function, those parts of the memory that the process needs are loaded into the RAM.
  • Advantage of mmap – It is extremely useful when multiple processes access the same file from memory. Changes made to the file by one process can be kept protected or shared with another process with the help of flags passed as parameters to mmap, such as MAP_SHARED.
  • Usage of fprintf (FILE *stream, const *char format, …) and write (int fd, char *format, int length, …) and some standard file pointers like stdin, stdout, stderr, etc., defined for fprintf () in the standard library – stdio.h and some standard file descriptors such as STDOUT_FILENO, STDIN_FILENO, STDERR_FILENO, etc., defined for write () in the library – unistd.h.
  • Read about and used fstat – an interface call that gets the status of a file in linux. Below is the basic usage of fstat to get the file size.
struct stat fs;
fstat(fd, fs);
int length = fs.st_size;

GSoC 2014 Update : Caching Plugin for Monkey server – Week 1

This is my first week’s update on my GSoC Project – Developing a caching plugin for Monkey server.

During the community bonding period, I had started coding some of the basic data structures that would be required for the caching plugin, such as the hash table for lookup of a resource and a min heap for the deletion of the file that has been used the least number of times. Since these are independent structures from the code, I decided to get started with these. I had done a very simple implementation of these.
Work done this week :
The designed and coded  are the functions that I added :
  1. Hash table : 
    1. create_ht
    2. insert_ht
    3. lookup
    4. delete_ht
  2. Min heap : 
    1. insert
    2. pop
During the first week, I improvised on my code and also added a makefile for easy compilation by referring to [3].
I then concentrated on reading up more about linux system calls that would enable memory copying of files in the cache. My mentor, Eduardo suggested that I try using mmap () for this purpose. mmap is basically a linux system call, an interface, that can be used to map files onto the virtual memory of the process that calls it. You can read more about it at [1] and [2]. I tried out mapping files using mmap function.
Problem faced :  
I had some problems with the makefile and also the delete function of hash table. However I fixed both these problems. You can view my code at [4].
Work for next week : 
I figured out that I would also need to copy the hash table also onto the memory. So I am going to be working on that for the next week. I will also improvise more on my code.
Links : 

Selected for Google Summer of Code 2014.

Recently, my proposal for Google Summer of Code 2014 in Monkey HTTP Server got selected. I have posted my GSoC proposal here.

Advanced Caching plugin for Monkey server

Organization: Monkey Project

Short description: This proposal provides a brief description of the implementation of a caching plugin for Monkey server. This plugin will be of great benefit to the server since it will reduce the server load and increase the performance of the server by enabling to handle requests from more clients. More has been described in the proposal.

Table of Contents:

  1. Short Description
  2. Deliverables
  3. Workflow
  4. Implementation details
  5. Cache Algorithm
  6. Benefits to the project
  7. Timeline
  8. About myself

1. Short Description:

In conditions of high traffic, where the servers are loaded with a large number of requests, it sometimes happens that a server is requested for the same information repeatedly by clients, that too, often within a very short span of time. This clearly wastes the performance of the server since it could be tending to the requests of many other clients.

As a solution to this problem, the idea proposes a caching plugin for Monkey server. A cache is a collection of locally stored data that are often requested by the client or user. Since a cache is physically more close to the requester, information from the cache can be accessed more quickly than from the server itself. With the help of the cache, information that was recently requested is stored in the cache. The performance of the server, to satisfy the requests of multiple clients can be improved.

2. Deliverables:

 By the completion of this internship, the following will have been delivered to Monkey project.  

  1. A cache plugin for Monkey server.

  2. A GUI for the plugin, for the admin to keep track of how the performance of the server by keeping track of the number of successful attempts in fetching information from the cache and number of failed attempts.

3. Workflow:

  1. As the server starts running, it loads all the plugins that have been enabled by the user. The cache plugin will be an optional plugin – the user has the option to enable or disable it while configuring the server, as is the case with other plugins.
  2. As the requests arrive, they are redirected to threads that are least loaded (that have the least number of active connections) by means of asynchronous sockets that are used by the server.
  3. At this point, the caching plugin intercepts if it is a GET request and with the name of the resource, it looks up at cache, instead of fetching data from the server. This step could have the following two alternatives:
    1. On looking up, if it finds the entry of the resource, it gets the cache memory location of the resource.
      1. The plugin then caches and retrieves the contents of the file from the cache memory.
    2. If it does not find the entry, then it fetches the resource from the disk.
      1. While returning it to the client, it checks the configuration file for the cache.
      2. If the file satisfies all the rules that have been mentioned in the configuration file, for the cache, then it stores a copy of the file in the cache, along with the time of storage. It then returns the resource to the client.
Here is a diagram of how the cache works fits in the entire system : 
 
 
 

4. Implementation Details

4. 1. The two main tasks involved in this project are the following:

  1. The cache plugin: In order to find out if the resource is present in the cache or not, a hash table is maintained, that stores the name of the resource and the memory reference where it is located in the cache. A hash table has been chosen because, a lookup operation in a hash tables takes only O (1) time as opposed to greater time by other data structures. However, this decision is subject to changes and in case a more appealing solution turns up, it may be adopted. But this does not create any major changes in the implementation. Only the memory location of the resource needs to be returned.
    1. Handle only get requests – URL parsing – using functions in src/mk_http.c. The various functions that have been defined in src/mk_http.c help in finding out if the method is GET or not.
    2. After that instead of calling mk_http_init () in src/mk_http.c a checking function is encountered that checks if the plugin has been enabled or not.
    3. If the plugin has been enabled, lookup (key) the function for hash table lookup is called, passing as parameters to it, the name of the resource.
    4. In the table if the resource is found, the function returns the memory location where the resource is located in cache.
      1. Next, the retrieve (memory_location), another function of the plugin, will retrieve the file from the cache, with the help of the memory location reference that is passed to it.
      2. The contents of the file is then written to through the socket connection that is open with the client.
    5. In the hash table if the function is not found, then the mk_http_init () function is called, that normally retrieves the content of the file requested, from the disk of the server.
      1. After it checks the validity of the file, (for example, if the file exists, and if it is accessible, etc. ) it is retrieved from the server accordingly.
      2. The configuration file is checked for rules which are to be obeyed, while storing the file in the cache. If the file satisfies all rules, it is stored in the cache.
  2. A front end with Twitter bootstrap and Angular JS: For the convenience of the admin, the plugin will also have a simple graphical user interface. The plugin will have a front end developed with the help of Twitter bootstrap and Angular JS. Since HTTP callbacks are used, the output obtained could be specified to be in JSON format. It will help the admin keep track of the following: 
      1. Number of hits and misses.
      2. How much area of the cache has been occupied.
      3. Option to clear the cache.
      4. Live displays statistics of hits and misses so far.
4. 2. Files included are :
  1. Hash table: This file would contain the following functions among others:
    1. Insert (resource_name) – This function helps insert values into the hash table.
    2. Lookup (key) – This function is for searching if a particular key exists in a hash table. The key here, is the resource and path of the resource combined together.
    3. Remove () – This function checks the duration of time that each file has been in the cache for and removes any such file that has exceeded the time limit mentioned in the configuration file.
  2. Configuration file : This file contains the rules for files that can be stored in the cache. It can be customized according to the needs to the user. Examples for some of the rules are: 
    1. Limit of size of files that can be stored in the cache.
    2. Types of files that should be stored in the cache.
  3. Cache Plugin: This would contain the following functions among others:
    1. Retrieve (memory_location) – This function caches into memory and locating the file. Note that each time a file is retrieved successfully from the memory, the number of hits is incremented.
    2. Write (buffer) – Writing the contents of the file on the socket connected with the client.
    3. Store_in_cache (file) – This function checks the rules in the configuration file, and if the file satisfies the conditions, then it is stored in the cache and the memory location where it was stored, is returned.
4. 3. Issues to be handled during implementation:
 
No race condition while accessing the cache by different threads. Therefore, it should be made sure that no other thread accesses the cache while it is being written by another thread.
 

 5. Cache Algorithm

 Solution 1 : Implementation using min heap.

 The first solution is to use a min heap to store the files and in a cache. The nodes of the heap is used to store files. Each node a.k.a. file is inserted into the min heap using the access count which indicates how many times the file has been accessed. Each time a file is looked up for, using the hash table, the access count is incremented by 1. The heap is then balanced accordingly. The node at the top of the min heap will have the minimum access count. When the cache becomes full, the node at the top (the root node) is deleted, the new resource could be inserted, and the heap is again balanced accordingly.

Run time calculation:
The time complexity for operations of a min heap is as follows :

Insertion : O (log n)
Deletion : O (log n)
Lookup : O (log n)

Time complexity for operations of a hash table = O (1)
Total run time of that LFU cache : Time complexity of lookup in hash table + Time complexity of operations in min heap.

Since the run time for hash table operations is constant, the run time of the cache entirely depends on the time complexity of operations of the min heap, which is, O (log n).

Solution 2 : Implementation using two doubly linked lists.

For the implementation I would like to propose the idea of using doubly linked lists so that the run time of the cache could be optimized to O (1). The doubly linked list called a frequency list, will have nodes that represent different frequencies. Each node in the list is a head to another doubly linked list that contains all files that have the same access number.
So essentially when the first request comes, the hash table is empty and so is the cache. Thus it is retrieved from the server and before returning to the client, it stores a copy of this in the cache. This file is inserted in the linked list of frequencies, inside the linked list of frequency 1. If another file is requested for by the client, it is inserted in the cache after retrieving from the server. In the cache, it is inserted inside the linked list for frequency 1 files. Thus the insertion time is O(1). If suppose the file that was first inserted, was requested for again by the client, the memory location would be present in the hash table. Its access number is incremented to 2. It checks if there is a node for frequency 2. If no, it creates a node for frequency 2 immediately after 1 and then inserts the file in linked list for frequency 2. Thus the time complexity for accessing a file is O(1). If the cache becomes full, the element that has the least access number is to be removed from the linked list. This would any file in the linked list for frequency 1. So the deletion time is also O(1).
The following diagram will give you clear picture of the implementation proposal.
 
 
Run time complexity calculation:
The time complexity for the operations of a doubly linked lists:
Insertion : O (1)
Deletion : O (1)
Lookup : O (1)Time complexity for operations of a hash table = O (1)
Total run time of the LRU cache : Time complexity of hash table + Time complexity of the doubly linked list.Since the hash table operations take only constant time, it should not affect the run time of the cache. Thus the runtime would be entirely dependent on the runtime of the linked lists, which would be O(1).

6. Benefits:

The main goal of Monkey server is to provide a high performance with low usage of resource. By cutting down on the requests for same resource, that the server has to deal with, it makes way for better performance as the server can accommodate other requests. The following would be the benefits that the plugin will provide to the server:
  1. Client’s benefit : Faster access to data and reduced time lag between requesting for information and receiving it.
  2. Reduced load for the server: Fetching same data multiple times can be avoided thereby saving the energy of the server.
  3. Increased performance of the server: It can satisfy the requests come from many clients.

7. Timeline:

Now – April 21st

  • Study about the internal structure of Monkey server.
  • Submit quality patches so that I understand the structure better.

April 21st – May 19th

  • Continue studying about the structure of the server.
  • Investigate more into how to handle plugin data and come up with a solution.
  • Community Bonding Period: During this period, I would like to take some solid work like documenting some specific part of the server, so that it strengthens my knowledge about the internals of the server.

May 19th – May 25th

  • Discuss with the mentor about weekly plans.
  • Analyze, compare and contrast the advantages of using various data structures for lookup purpose.
  • Decide on whether or not to change the decision on using a hash table for lookup.
  • Decide on whether to use a min heap or doubly linked lists for the purpose of implementation of an LRU cache.
  • Discuss with the mentor about how to improve the performance of the plugin.

May 26th – June 1st

  • Finish writing the configuration file that contains functions that check if a file is eligible to be stored in a cache.

June 2nd – June 8th

  • Read documentation for implementation of the decided lookup structure (as of now, hash table).
  • Implementation of the data structure lookup, along with insert, lookup and delete functions.
  • Simultaneously test if it works alone as a standalone structure.

June 9th – June 15th

  • Implement the cache with the solution that has been decided upon during the initial phase.
  • Write the cache plugin with functions to retrieve the file from cache, store the file that is retrieved from the server, in the cache, etc..

June 16th – June 22nd

  • Complete the implementation of functions for the cache. 
  • Implement handling of data that is requested by plugins from various servers.
  • Test integration with hash table.

June 23rd – June 29th

  • Continue integration with Monkey server: Testing and trying for various incoming requests, if the cache works fine.
  • Constantly try to optimize the performance of the plugin.
  • Get code written till date, ready for midterm evaluation.

Deliverable at the end of this period: The cache plugin would be ready by the end of midterm evaluation period.

June 30th – July 6th

  • Buffer period for completing any incomplete work in the functioning of the plugin.
  • This is also the buffer time for changing anything that was found inappropriate previously, or anything that was found to bring down the performance of the server.

July 7th – July 17th

  • Read the documentation for developing user interface with twitter bootstrap.
  • Implement a basic GUI using bootstrap.

July 18th – July 28th

  • Improvise by taking comments from the mentor and others in the organization.
  • Integrate with the data that has been stored – number of hits and misses.
  • Integrate with statistics and create live displays, as mentioned in the second part of the implementation details section.

July 29th – August 3rd

  • Testing phase: Do a thorough testing of the graphical user interface and see if the data displayed is correct.
  • Check the working of the entire plugin.

August 4 – August 10th

  • This is the period for cleaning up any code cleaning up.

August 11th – August 17th

  • Complete the remaining documentation and get the code ready for submission.